Oral Biology and Medicine

Statistical Management of Data in Clinical Research Joseph L Fleiss and Albert Kingman

I. INTRODUCTION We consider a number of prevalent problems in the design and analysis of clinical studies in dental caries and in periodontology: sample size determination, confidence intervals for important measures of treatment effectiveness, parametric vs. nonparametric methods, and the measurement of interexaminer agreement. The procedures we recommend are applicable to studies in either specialty. A problem that is unique to studies in periodontology, whether the appropriate unit of analysis is the patient or the periodontal pocket, is also considered.

II. CARIES TRIALS In spite of the remarkable reduction in the prevalence of dental caries that has occurred since the 1960s,1 clinical trials of new anticaries agents continue. These trials tend to be of two kinds, one seeking to establish the superiority of the new agent over a standard treatment and the other seeking to establish equivalence. The analyses of and required samples sizes for these two kinds of trials differ. We consider first, in detail, the more familiar study aimed at establishing superiority. Several of the statistical techniques to be presented are relevant to both kinds of trials. The number of decayed missing and filled surfaces (DMFS score) and the number of decayed, missing, and filled teeth (DMFT score) are still the most popular measures of the extent of dental caries within a patient. For simplicity, we restrict attention here to the DMFS score. A. Sample Size Determination The most important statistical property of a randomized clinical trial is its power, the probability that, when in fact the new agent is superior, the trial will correctly end with the conclusion that the new agent is significantly superior to the control. Power depends on a great many factors — precision of measurement, control for important clinical and demographic characteristics that affect response to treatment, baseline (i.e., pretreatment) DMFS score, etc. 23 — but on none more importantly than the sample size, i.e., the number of patients in each treatment group.4 The determination of the required number of patients to enroll in a clinical trial is straightforward, but it requires that the investigator specify the values of certain statistical parameters. The following formula is for the number of patients per group on whom complete data are available; if the investigator expects a certain fraction of patients to drop out of the trial, the sample size should be increased accordingly. The formula is

n =

2a 2 (z a (1)

A two-tailed test should be performed, so that one will declare a difference to be statistically significant either if the new agent is superior to the control or, contrary to expectation, if it is inferior. The Greek letter a represents the significance level of the two-tailed test (the usual value of a is 0.05, but one occasionally sees a = 0.01), and the quantity za/2 is the tabulated value cutting off the proportion a/2 in the upper tail of the standard normal distribution (symmetrically, - za/2 cuts off the proportion a/2 in the lower tail). For example, zoU2 = 1.96 for a = 0.05 and z ^ = 2.58 for a = 0.01. The standard deviations of the DMFS scores within the two groups are assumed to be equal, even if the mean scores differ. The quantity a 2 , the variance, is the square of the assumed common standard deviation. Earlier studies of similar populations of patients or a pilot study in the population of interest may be relied on to provide an estimate of a 2 . The two remaining quantities in Formula 1, zp and 82, are related. The quantity 8 represents a clinically important difference between the mean DMFS scores of the two groups, and the probability l-(3, the power, is the likelihood that the significance test will reject the hypothesis of equality when 8 is the underlying difference between means. The quantity zp is the critical normal curve value cutting off the proportion P in the upper tail. For example, if the power is set equal to 0.80 (this is the customary value for 1-P when a = 0.055, then p = 0.20 and zp = 0.842. The value of 8 may be obtained from earlier studies with similar products or from a pilot study, or it may represent the minimum difference that others in the field would accept as representing an important effect. In some references (especially those published over 20 years ago), the sample size formula is given as n = 2a2z2/2/82 This amounts to taking zB = 0 in Formula 1, which is the same as setting the power at 50%. Sample sizes derived from this formula should not be considered as acceptable, but rather as absolute minimums.

J. L. Fleiss earned his A.B., M.S., and Ph.D. from Columbia University, New York, New York. Dr. Fleiss is a Professor and Head of the Division of Biostatistics, Columbia University, School of Public Health, New York, New York. A. Kingman received his A.B. at Calvin College, Grand Rapids, Michigan; his M.S. was earned at Michigan State University, East Lansing; his Ph.D. was received at Colorado State University, Fort Collins. Dr. Kingman is a Statistician with the Epidemiology and Oral Diseases Prevention Program, National Institute of Dental Research, Bethesda, Maryland.

1990

55

Critical Reviews In B. Sample Attrition It is important for the reader to remember that sample attrition will likely occur, so that the number of patients per group who remain in the trial until its conclusion will be a fraction of the total number who start, say N. For a 3-year study, the relation is N =

(1 - AR)3

(2)

where AR denotes the annual attrition rate. An example will serve to illustrate the importance of designing the study with an anticipation of a certain rate of attrition. Suppose that a hypothetical company decides to increase the concentration of fluoride in its dentifrice by 100% and would like to test whether the new product is superior to its standard dentifrice. The company intends to recruit only 11to 13-year-old children who have evidence of dental caries (i.e., who have a DMFS score of 1 or greater). The company's investigators believe that such subjects should have, on the average, a 3-year increment of 3.5 DMFS using the standard dentifrice, with a standard deviation of a = 3. The investigators have experienced a 15% annual attrition rate in prior dentifrice trials. How many subjects would have to be randomized to each group initially to insure that, if the percentage reduction in new DMFS is 10% or greater, there should be an 80% probability of finding a statistically significant difference between the new and standard dentifrices? A two-tailed Mest with significance level of 0.05 will be used. A 10% reduction corresponds to a predicted mean increase in DMFS score of 3.5(1 - 0.10) = 3.15 in the high-fluoride group, so that 8 = 3.5 - 3.15 = 0.35. Furthermore, zoU2 = z o.o25 = 1-96 and zp = z 020 = 0.84, so that the initial sample size per group should be large enough to assure that 2 x 3 2 x (1.96 + 0.84)2 n

=

children are available for analysis at the end of 3 years. With an annual attrition rate of 15%, the number of children to be randomized to each group becomes 1150 N = (1 - 0.15)3

(n2 n2 - 2

(3)

The t-ratio is equal to t =

which simplifies to (4) Significance is declared if the absolute value of t exceeds the critical value za/2 (the sample sizes are usually so large that the theoretically more appropriate critical value of the t-distribution with n2 + n2 - 2 degrees-of-freedom is virtually identical to the critical value of the normal distribution). Consider as an example the summary results in Table 1 of an unpublished study comparing 3 years of rinsing with a 0.025% NaF mouthrinse against 3 years of rinsing with a conTable 1 Summary DMFS Scores for a Randomized Double-Blind Study Comparing a 0.025% NaF Mouthrinse Against a Control

1150 = 1875 0.614

If the adjustment for the attrition rate had been ignored (e.g., if no attrition was anticipated), the study would have begun with 1150 subjects per group but would likely have ended 3 years later with approximately 705 subjects per group. The trial would have had a power of just under 60%, instead of the desired power of ! C. Parametric Analyses Studies of dental caries are typically conducted on children 56

of similar ages and call for children to remain on their randomly determined regimen for between 2 and 4 years.6 The experiences of the children in the two groups are summarized by the means and standard deviations of the increments in DMFS scores, the changes from baseline to final examination, and the two mean increments are compared by the classical independent-sample /-test. Let n1? l{ and st represent for group 1 the number of children studied, their mean increment, and the standard deviation of their increments, and let n2, I2, and s2 represent the same for group 2. (The study is usually designed to have equal sample sizes in the two treatment groups, but the sample sizes usually end up unequal in practice.) The Mest requires for its validity that the two underlying standard deviations be equal, and it employs as an estimate of the assumed common standard deviation the square root of

Pretreatment (X)

Increment (I)

Group

Sample size

Mean

S.D.

Mean

S.D.

1 (control) 2 (0.025% NaF)

225 252

7.50 7.39

8.23 8.52

3.24 2.66

4.26 4.29

Note: Average slope of line associating I with X = 0.19. Average correlation between I and X = 0.34.

Volume 1, Issue 1

Oral Biology and Medicine trol. The value of the average variance in Formula 3 is

(represented by group 2, say) and the control (represented by group 1) as the proportionate reduction in caries increments, say

224 x 4.262 -h 251 x 4.292 = 18.2831 224 + 251

P =

and the value of the t-ratio in Formula 4 is

t =

3.24 - 2.66 x 4.28

/225 x 252 '225 + 252

For the illustrative example, the proportionate reduction in caries increments due to the active mouthrinse was

= 1.48

There is no statistically significant difference between the mean increments for the two mouthrinses. This Mest, while providing a fair and unbiased comparison between the two groups, may be improved upon slightly in terms of its power. The calculation of each child's increment score takes partial but generally not full advantage of the association that exists between the baseline and post-treatment scores.7 The technique known as the analysis of covariance, on the other hand, takes full advantage of any linear association that exists between the initial and final DMFS scores810 and is available in virtually all packages of statistical programs for the personal computer. The analysis of covariance produces a new t-ratio, say t', given, to an excellent degree of approximation, by the expression (L - L) - b(X, - X2)

a reduction of nearly 20%. An informative way of summarizing the information available about a population parameter is by means of a confidence interval, a range of values that, with specified confidence level 100(1 - ot)%, will include the true parameter value. Notice that the confidence level for the confidence interval and the significance level for the significance test are complementary, the usual confidence levels being 95% (corresponding to a = 0.05) and 99% (corresponding to a = 0.01). An approximate 100(1 - a)% confidence interval for the proportionate reduction in caries increments in the underlying population is11

(5)

X! and X2 represent the two groups' baseline means; b represents the average of the two slopes of the straight lines associating the increment with the baseline value, fitted by least squares; and r2 is the square of the average correlation coefficient associating the two values. The investigator should be aware that identical conclusions are reached whether the analysis of covariance uses the increment as the response variable and the baseline value as the covariate or uses the post-treatment DMFS score as the response variable and the baseline value as the covariate. Taking the increment as the response variable is preferred because of its greater familiarity. For the example, the average slope of the line associating the DMFS increment with the baseline DMFS score is b = 0.19, and the average correlation coefficient is r = 0.34. The value of the t-ratio in Formula 5 becomes t' =

(6)

I,

More exact but more complicated confidence intervals, including those appropriate when the analysis of covariance is employed, are available.12 For the current example, an approximate 95% confidence interval for the underlying proportionate reduction in caries increments is 0.18 ±

1.96 x 4.28 / 1 V252 3.24

/2.66V V3.24/

1 y 225/

an interval that extends from - 0 . 0 4 to 0.40. An underlying percentage reduction of 40% is consistent with the data, but so is an underlying percentage increase of 4%. The inclusion of the value 0 in the confidence interval is consistent with the absence of statistical significance by the r-test. D. Nonparametric Analyses The above analyses are examples of parametric statistical analyses in which the assumption is implicitly made that the responses are normally distributed. Investigators unwilling to make such an assumption about DMFS scores or their increments may instead rely on any of a number of nonparametric statistical tests.13 The most powerful nonparametric alternative to the independent-sample f-test is the Mann-Whitney-Wilcoxon test,13 which calls for all nx + n2 response measurements to be ranked and for the mean of the ranks assigned to the

(3.24 - 2.66) - 0.19(7.50 - 7.39) 4.28V1 - 0.342 225 x 252 = 1.51 225 + 252

slightly larger than the value of the ordinary t-ratio that did not completely adjust for baseline differences. It is popular to report the difference between the new agent

1990

57

Critical Reviews In measurements on the r^ patients in group 1, say R1? to_be compared to the mean rank for the second group, say R2. Significance is declared if the absolute value of

(8) exceeds z a/2 . While valid for any distribution of caries increments, not just the normal, the Mann-Whitney-Wilcoxon test and other methods based on ranks do not lend themselves easily to such variations on the f-test as analysis of covariance and confidence intervals for important parameters that are so useful in data analysis and summarization. Even though the measurements themselves are not normally distributed, the central limit theorem, together with the sample sizes in the hundreds that are typical in caries trials, assures that the sample means are effectively normal and that the inferences from the parametric procedures are valid.14 Except for caries studies whose sample sizes are unusually small, we recommend parametric procedures for general use. E. Alternatives to the DMFS Score Dissatisfaction with the DMFS score has led some investigators to develop rating scales for measuring the severity of dental caries. An important example is Grainger's Severity Index, 1516 which classifies a patient into one of five mutually exclusive ordered categories depending on the teeth and surfaces showing evidence of caries. The statistical method known as ridit analysis1119 is appropriate for such categorical data: it takes advantage of the natural ordering to the categories but does not require that arbitrary numbers be assigned to them. The method may be implemented as follows, using the results of a randomized placebo-controlled mouthrinse study for illustration (see Table 2). 20 Step 1 The relative frequency distribution of the two samples combined is determined; p{ is the proportion of the combined sample classified into category i (i = 1 represents the best outcome, i = 2 the second best, . . . , i = 5 the worst). For example,

= (19 + 12)/(200 + 209) = 31/409 = 0.076 and p5 = 19/409 = 0.046.

Pl

Step 2 The ridit values for the individual categories, say r1? . . . , r5, are determined as follows: ^ = px/2; r2 = p2 + p2/2; r3 = (Pi + P2) + Pa/2» • • • '> r5 = (Pi + P2 + Ps + PJ + p5/2. Thus, each category's ridit value is equal to the sum of the proportions in all of the better-ranking categories plus half of the proportion in the given category. For example, r3 = (0.076 + 0.421) + 0.333/2 + 0.664. Step 3 Calculate rx and r2, the mean ridits in the two-treatment groups. Thus, rx = (19 x 0.038 + 89 x 0.287 + . . . + 7 x 0.977)/200 = 0.473 andr 2 = (12 x 0.038 + 83 x 0.287 _ + . . . + 12 x 0.977)/209 = 0.528. The difference r2 rl between the two mean ridits, plus 1/2, is an estimate of the chances that a randomly selected patient from the actively treated group responds as well as or better than a randomly selected patient from the control group. If the resulting value is equal to 0.5, the conclusion is that the two groups ended up essentially the same. If it is less than 0.5, the conclusion is that the members of the control group tended to respond better than the members of the experimental group. If the value is greater than 0.5, the conclusion is the opposite. Here the value is (0.528 - 0.473) + 0.5 = 0.555, so the probability is estimated to be over 55% that a typical child from the weekly rinse group ends up the same as or better than one from the placebo rinse group. Step 4 The significance of the difference between the two mean ridits may be tested by comparing the absolute value of Z = (r2 -

x

12nxn2 n, + n9 + 1

to za/2. Here, = (0.528 - 0.473) x ^

111 x 200 x 209 —

= 1.92

Table 2 Post-Treatment Distributions of Patients in a Randomized DoubleBlind Trial on the Modified Grainger Severity Index

Group Weekly rinse Placebo rinse Overall proportion (p) Ridit value (r)

58

Sample size 200 209

Outcome category 1 (Best) 19 12 0.076 0.038

5 (Worst) 89 83 0.421 0.287

Volume 1, Issue 1

63 73 0.333 0.664

22 29 0.125 0.893

7 12 0.046 0.977

(9)

Oral Biology and Medicine which just misses being statistically significant at the 0.05 level. When more sensitive but more complicated methods of analysis are applied,20 the difference becomes significant. F. Trials Intended to Establish Equivalence The methods of analysis required to validly test whether a new agent is equivalent to an existing, effective one are different from the ones considered in the preceding sections. Specifically, it would generally be a mistake simply to perform a significance test and to conclude that two treatments are equivalent whenever their means do not differ significantly. For one thing, small sample sizes virtually guarantee a failure to find significance. The absence of statistical significance in a study with relatively few patients hardly represents definitive evidence that two treatments are equivalent. For another, it is inherent in the logic of statistical inference that one draws a definitive conclusion from a set of data only when a hypothesis is rejected, not when it fails to be rejected. The challenge is then to rephrase the statistical hypotheses so that their rejection leads validly to the conclusion that the agents are equivalent. One widely accepted solution to the problem is to relax the definition of equivalence so that two agents are considered to be "effectively equivalent" if the difference between their underlying means is close to zero. It is essential, in order to remove the opportunity for bias, to specify before any data are collected what "close to zero" represents numerically. It should be a difference, say A, which is close to the investigators' boundary between clinically important and unimportant differences. If the difference between the two treatments' underlying means was within the interval from - A to A, the investigators would conclude that the treatments were effectively equivalent. If the underlying difference was outside of the interval, the conclusion would be that one agent was superior to the other. The statistical analysis of the data makes explicit use of the prespecified value of A by testing two hypotheses. Hypothesis 1 is that the underlying difference between means is less than - A, and hypothesis 2 is that the underlying difference is greater than A. Each of these one-sided tests is performed at the significance level a/2, and only if both hypotheses are rejected will the conclusion be drawn that the two treatments are effectively equivalent. The decision rule is to conclude effective equivalence if both of the inequalities A - za

(10)

and

(ID are satisfied.

The reader will note that the decision rule simplifies to the following rule based on a confidence interval for the difference between the mean increments: the two treatments will be considered effectively equivalent provided the 100(1 - a)% confidence interval (I, - I 2 ) ±

Za/2 s

(12)

lies entirely within the interval from - A to A. If the confidence interval excludes zero but includes - A , the correct inference is that treatment 1 is superior to treatment 2. If it excludes zero but includes A, the correct inference is the converse. If the interval includes the value zero as well as either one or both of Aj and A2, no definitive conclusion is possible (other than that the sample sizes were too small). For the data in Table 1, a 95% confidence interval for the underlying difference is f

(3.24 - 2.66) ± 1.96 x 4.28

225 + 252

225 x 252

or the interval from - 0 . 1 9 to 1.35. Had the outer limit for effective equivalence been specified a priori as a difference of one decayed, missing, or filled surface (i.e., as A = 1), the study would not have been able to conclude that the two agents were equivalent. The sample sizes in the two groups that are required to test for equivalence are greater than the sample size per group given by Formula I.21 If the two treatments actually have identical means, and if the probability is to be 1 - (3 that the confidence interval in Formula 12 lies within the limits - A and A, the required sample size per group becomes 2cr2(za n =

L

p/2.

(13)

Formulas 1 and 13 differ in that the former involves the normal curve critical value cutting off the fraction (3 in the upper tail, whereas the latter involves the normal curve value corresponding to the fraction p/2. Assume that a = 0.05 and 1 - £ = 0.80. From Formula 1, the sample size per group required to test whether two means are different is proportional to (z0 025 + Z0.20)2 = (1-960 + 0.842)2 = 7.85. From Formula 13, the sample size per group required to test whether two means are equivalent is proportional to (z 0025 4- z 010 ) 2 = (1.960 + 1.282)2 = 10.51, approximately one third greater. Suppose, for the current example, that a study was being planned to test whether treatments 1 and 2 were equivalent, with the value A = 1 representing the limit of effective equivalence. With a taken to be equal to 4.28, with a = 0.05, and with 1 - P = 0.80, the required sample size per group would be 1990

59

Critical Reviews In

n =

2 x 4.282(1.960 + 1.282)2

ooc

= 385

Due to the decline in the prevalence of dental caries in the 1970s and 1980s, conducting caries clinical trials to test either superiority or equivalence has become very difficult and expensive. But such trials are not necessarily unimportant because there remain subpopulations that are very susceptible to dental caries. These considerations make it all the more important to use efficient study designs and analytical methods for caries clinical trials. Since differences in mean DMFS scores between study groups will be small (relative to 10 to 15 years ago), the group sizes need to be extremely large in general populations. This will probably necessitate conducting clinical trials in "high risk" populations. Regardless of the characteristics of the study population, there is no substitute for the use of efficient study designs and proper statistical data management in order to reduce cost and effort in conducting clinical trials.

III. TRIALS IN PERIODONTITIS An increasing amount of attention has been given in recent years to methodological issues in the design and analysis of clinical studies in periodontitis. Because clinical research in periodontitis does not yet have the long history that clinical research in caries does, statistical traditions comparable to analyzing whole-mouth DMFS scores have not yet evolved. In fact, controversy has existed concerning the appropriate clinical measure of disease activity, change in pocket depth vs. change in attachment level, and concerning the appropriate unit of statistical analysis, the individual site vs. the individual patient. We summarize these controversies and offer our opinions as to how they should be resolved. A. Clinical Measures The two most widely used measures in clinical research in periodontology are probing pocket depth and attachment level. Both rely on the use of a periodontal probe graduated by half or by whole millimeters, and both pertain to the loss of periodontal tissue below the cemento-enamel junction (CEJ). They differ in how the measurement is defined. Probing pocket depth is measured from the free gingival margin to the base of the pocket, whereas attachment level is measured from the CEJ to the base of the pocket. The problem of which measure to use is not a trivial one. It is a widely accepted principle in clinical trials that the investigators should identify, if possible, one single response variable whose analysis will answer the study's primary question.22 One alternative is to consider two or more variables as being equally important and to run the risk of uncertain and confused conclusions if some give statistically significant results and some do not. A second alternative is to identify as the primary response variable the one that was found to produce

60

the most significant result. This alternative cannot be considered acceptable in clinical research. Ideally, the choice between pocket depth and attachment level as the primary response variable for a future study would be made on the basis of how they compared in previous studies in which both were measured: the one that tended to give larger values for the t-ratio comparing two treatment groups would be the logical choice for a new study. We are not aware, however, of many published studies in which both variables were measured and the separate results reported. Until the results of a large number of such studies become available, we suggest another criterion, that of reliability. The variable that is more reliable, in the sense of the measurements on it being more reproducible, seems to be the better candidate for being the primary one. According to the values in two different studies of a widely used statistical measure of reliability to be described later, the intraclass correlation coefficient, attachment level appears to be more reliable than probing pocket depth. In one study,23-24 reliability was assessed on the basis of repeated measurements on individual sites for 34 patients over the course of 1 day. The intraclass correlation coefficient, which varies on a scale from 0 to 1, with 1 representing perfect reliability, averaged a near-perfect 0.94 for attachment level and averaged a discernibly lower 0.84 for probing pocket depth. In the other study,25 analyses that have not yet been published were performed on repeated measurements of whole-mouth means for 80 patients over the course of several years. The value of the coefficient was 0.85 for attachment level and 0.67 for probing pocket depth. Two studies, especially ones that are so different, are hardly sufficient for inferring general conclusions about the relative reliabilities of the two competing measures. Until the results of further large-scale reliability studies are reported and a consensus develops, the individual investigator is advised to conduct his or her own small-scale reliability study before embarking on a large-scale clinical trial and to use as the primary response variable in the trial the variable found to be more reliable. The next section presents methods for analyzing data from a popular kind of reliability study. B. Interexaminer Reliability Study Table 3 presents results from a hypothetical reliability study in which each of three examiners independently measured each of a sample of 10 patients for attachment level. The order in which the examiners conducted their examinations was at random. Measurements were made at six sites on each extant tooth, and the measurements were averaged across all sites to produce the whole-mouth averages in the table. The following statistical model is assumed to represent the factors contributing to a typical whole-mouth average, A:

Volume 1, Issue 1

A = P + X + E

(14)

Oral • V

Biology and Medicine Table 4 Analysis of Variance Table for the Values in Table 3

Table 3 Results of a Hypothetical Interexaminer Reliability Study o1F Whole-Mouth Averages of Attachment Level Examiner Patient

i

2

3

Sum

1 2 3 4 5 6 7 8 9 10

2.4 0.9 1.1 4.6 3.7 5.2 2.0 3.3 4.0 3.4

1.7 0.8 0.6 4.0 3.8 4.5 1.7 2.9 4.1 3.0

2.8 1.3 1.4 4.9 4.2 5.1 2.6 3.5 4.5 3.8

6.9 3.0 3.1 13.5 11.7 14.8 6.3 9.7 12.6 10.2

30.6

27.1

34.1

91.8

Sum

Source of variation

Degrees of freedom

Sum of squares

Mean square

Between Patients Between Examiners Random Error

9

52.7520

5.861

2

2.4500

1.225

18

0.5500

0.031

29

55.7520

Total

aminers is denoted by J (here, J = 3). Total sum of squares (TSS) = i

j

p = l x = l

Note:

i

, i

J

y ZJ y A 2^px - - I(yZJ ZJ y ZJ TT

1J

\

p

= i

x

A M

(16)

= i

Measurements in millimeters.

P represents the contribution made by the patient's underlying "true" score, which may be understood to be the average across a hypothetically great many replicate examinations on him or her; P is assumed to have a mean of \L and a variance of o>2 in a population of patients. X represents the contribution made by the particular examiner, which may be understood to be the difference between this examiner's mean and |x; the X's have a mean of 0 and a variance of a x 2 in a population of examiners from which three were selected. E, finally, represents the contribution made by random measurement error; E is assumed to have a mean of 0 and a variance of aE2. Under the frequently reasonable assumption that the true scores, the examiner effects, and the errors of measurement are uncorrelated, the variance of the whole-mouth average, crA2, is equal to oi =

Between patients sum of squares (PSS) =

Between examiners sum of squares (XSS) =

and Error sum of squares (ESS) = TSS - PSS - XSS (19) The reader may check that, for the data under analysis, TSS = 2.42 + 1.72 + ... + 3.02 + 3.82

(15)

When a random sample of patients is examined, with each examined by a possibly different randomly selected examiner, the variance of the resulting series of measurements is aA2, the sum of components due to true patient-to-patient variability, due to differences between examiners, and due to random measurement error. These components of variance may be estimated from the analysis of variance that would be applied to the values obtained in the reliability study. Table 4 presents the results of the analysis of variance applied to the measurements in Table 3. The general formulas for the sums of squares follow. Let Apx denote the whole-mouth mean for patient number p when examined by examiner number x. The total number of patients is denoted by I (here, I = 10), and the total number of ex-

1 (91.8)2 = 55.7520 10 x 3 PSS = - ( 6 . 9 2 + 3.02 + ... + 12.62 + 10.22) 1 (91.8)2 = 52.7520 10 x 3 XSS = — (30.62 + 27.1 2 + 34.12) 1 (91.8)2 = 2.4500 10 x 3 and 1990

61

Critical Reviews In ESS = 55.7520 - 52.7520 - 2.4500 = 0.5500 In general, the numbers of degrees-of-freedom for the sums of squares for patients, examiners, and error are I - 1, J — 1, and (I - 1)(J - 1); here, they are 10 - 1 = 9, 3 - 1 = 2, and 9 x 2 = 1 8 . The mean squares, upon which the estimated components of variance are based, are equal to the ratios of the sums of squares to their respective numbers of degrees-of-freedom. The general formulas for the estimated components of variance (the Latin s instead of the Greek a signifies an estimate derived from a data set) are PMS - EMS

^ —

XMS - EMS I

(21)

(22)

where PMS, XMS, and EMS represent the mean squares due to patients, examiners, and error. Here, , 5.861 - 0.031 s2 = = 1.94 1.225 - 0.031 = 0.12 10 and s2 = 0.03 These values exemplify the results that are generally hoped for, a large component of variance for true patient-to-patient variability and much smaller ones for interexaminer variability and random error. A single quantity that summarizes the relative magnitudes of the estimated components of variance is the intraclass correlation coefficient, R =

sP-f

+ s2

(23)

Values close to unity signify excellent reliability (in order for R to be close to unity, s x 2 and sE2 must be small relative to sP2), and values close to zero signify poor reliability (in order for R to be close to zero, s x 2 or sE2 must be large relative to sP2). Here, R = 62

1.94 = 0.93 1.94 + 0.12 + 0.03

SEM =

(24)

One may be 95% confident, for example, that a patient's true whole-mouth average lies within the interval A ± 1.96 SEM, where A is the measured whole-mouth average. For the data under analysis, SEM = V0.12 + 0.03 = 0.4

(20)

and

si = EMS

signifying excellent reliability. One may also use the results of the reliability study to estimate the standard error of measurement (SEM), an indicator of the statistical precision with which a single wholemouth average has been measured:

so that a 95% confidence interval extends from A - 0.8 to A 4- 0.8 mm. Confidence limits may be set for the change in the measured whole-mouth average to be expected over time. Assuming that the errors of measurement made on one occasion are uncorrelated with the errors made on another, and assuming that the examiners might be different on the two occasions, a 95% confidence interval for a patient's measured change, assuming that no change actually occurred, extends from -1.96-SEM-V2 to + 1.96-SEM-V2. Here, the interval extends from - 1 . 1 to + 1 . 1 . If a change between two measurements that falls outside of the interval is taken to represent real clinical change, the chances are 5% that a false-positive error will have occurred, that change will incorrectly be said to have taken place. If it is the same examiner who makes both measurements, the confidence interval for change narrows to ± 1.96VEMS. Here the interval for change extends from - 0 . 3 to +0.3. The length of the interval is reduced because interexaminer variability no longer contributes to the measured changes. Notice that no significance tests were applied to the data. The purpose of a reliability study is to estimate important parameters such as components of variance, the standard error of measurement, and the intraclass correlation coefficient, not to test hypotheses. The methods presented above are specific to reliability studies in which the same examiners examine every patient in the sample, and in which these examiners may be considered to be a random sample from a larger population of examiners. Other kinds of reliability studies call for different methods of analysis.2628 Applications of methods such as these to sitespecific measurements of probing pocket depth and attachment level exist in the literature.2932 The next section deals with the appropriateness, or not, of using site-specific measurements in clinical trials of periodontitis. C. Sites vs. Patients as Units of Analysis Controversy has existed concerning the proper units of analysis in clinical trials in periodontology, patients or sites.

Volume 1, Issue 1

Oral Biology and Medicine If it were valid to analyze data at the level of individual sites within patients — this would be the case if the correlations between measurements on different sites within the same mouth were zero32 — then an experiment with relatively few patients each measured at many sites would be as informative as (and less expensive than) an experiment with many patients and few sites. If, on the other hand, only analyses at the level of the patient were valid, then experiments with relatively many patients, whether each is measured at few or at many sites, would be required in order to achieve reasonable precision and power. The principles and formulas presented earlier in this article for sample size determination and for statistical inference would then apply. Several statisticians have demonstrated theoretically that, even if the correlations between sites are small (but not zero), sizable errors will be made if sites are analyzed.33'37 A few statistical analyses bearing on this issue have been performed on actual data sets.24-38-39 Each has demonstrated that the withinmouth correlations between sites are positive in patients with periodontitis. Haffajee et al.38 found that within-mouth correlations for pocket depth and attachment level tended to be lower than for other clinical variables, but were not zero. Fleiss et al.39 found the within-mouth correlations to be positive and, occasionally, of relatively sizable magnitude (correlations of 0.2 or more), both for attachment level measured during a single examination and for change in attachment level over a 9-month period. The most sizable within-mouth correlations were found by Fleiss et al., 24 values up to 0.5 for pocket depth and up to 0.8 for attachment level. In this study23 attention was focused on the buccal surfaces of the six key teeth of Ramfjord.40 The examiners' concentrated, unhurried attention to a small number of prespecified sites produced measurements having far greater within-mouth correlations than found when examiners were required to make measurements at scores of sites within a mouth. The excellent within-mouth correlations found in this study may therefore not be generalizable to those clinical studies in which all teeth and several sites per tooth have to be examined. On the other hand, the low within-mouth correlations found in other studies may not reflect clinical reality, but instead reflect poor reliability brought about by the need to make a great many measurements, possibly hurriedly. Poor reliability, after all, will always tend to diminish the magnitude of correlation coefficients.41 Theory and data both indicate that it is a mistake to employ statistical procedures that take individual sites as the units of analysis; it is the patient who must be the unit of analysis. The problem still remains, however, of how to characterize a given patient's degree of disease at any one time and the patient's change in disease status over time. Much research activity is being conducted to evaluate statistical and measurement procedures proposed by Imrey, 35 Laster, 37 and Donner and Banting.42 The most informative patient summary measure(s) will

depend on the purpose of the particular study. Whole-mouth averages can always be used, but this may not always be the most informative way to summarize the data. For instance, in a clinical study comparing treatments A and B, it may be difficult using the whole-mouth mean to detect that A does better than B on mesial and distal sites, but worse on buccal sites. In cases where such treatment differences can be hypothesized beforehand, one can resort to a partitioned summary of a patients' data by describing findings for mesial-distal and buccal sites separately. To illustrate, we use data from a hypothetical randomized parallel two-group study design (Table 5.) Table 5 Changes in Attachment Level (Initial Minus Final, in Millimeters) from a Hypothetical Study Comparing Two Treatments Measure

Treatment A (n = 25)

Treatment B (n = 26)

Whole-mouth mean Mean for mesial-distal sites Mean for buccal sites Correlation between mesial-distal and buccal means

1.96 ± 0.65a 2 .16 ± 0.80

1.76 ± 0.45 1.58 ± 0.53

1.56 ± 0.50 0.63

2 .10 ± 0.49 0.42

a

Mean ± standard deviation.

A comparison of the whole-mouth means suggests that these two treatments are similar in effectiveness. A Mest performed on these data confirms that impression (t = 1.28, d.f. = 49, p >0.2). However, an examination of the site-specific means suggests that the two treatments behave quite differently for different classes of sites. One could test for a difference between groups A and B for the mesial-distal sites and the buccal sites separately, using simple /-tests. For the mesialdistal sites, A is significantly superior to B (t = 3.06, p

Statistical management of data in clinical research.

Oral Biology and Medicine Statistical Management of Data in Clinical Research Joseph L Fleiss and Albert Kingman I. INTRODUCTION We consider a numbe...
1009KB Sizes 0 Downloads 0 Views