Combining and Comparing Area Estimates across

Studies

or

Strata

DONNA KATZMAN McCLISH, PhD A method for combining and comparing medical tests across studies or strata is presented. The area under the receiver operating characteristic (ROC) curve is the parameter of interest to be used for comparison. The combined area is a weighted average of the areas under the curve in each study or stratum. A chi-square test for equality of areas across strata can be used to compare the areas. The power of the test is also explored. The methods presented are simple and require only knowledge of estimates of area and their standard errors. Either parametric or nonparametric estimates of the area can be used. Key words: ROC curve; area under the curve; stratification; power. (Med Decis Making 1992;12:274-279)

The evaluation of medical

tests has become an imissue in recent portant years. Receiver operating characteristic (ROC) curves, which display plots of truepositive rates versus false-positive rates, are an important tool in this evaluation. The area under the ROC curve is the most widely used summary of the information contained in the ROC curve. When covariate information is available that may affect the diagnostic accuracy of the test, it is necessary to take this into account. Tosteson and Begg’ have presented a regression model to directly take covariates into account. A simpler, but seemingly less elegant, approach would be to stratify the data into relatively homogeneous subgroups based on the covariates

and son

Methods Suppose we are interested in evaluating a diagnostic or

could represent several different evaluation studies, or a stratification of a single study into distinct groups based on a variable or variables thought to impact the accuracy of the test being evaluated (such as age or sex). Whichever is the case, the area under the ROC curve and its variance can be estimated for each subgroup. Let us denote the area by A; and the variance

by Var(A;). A reasonable estimate of the common area is the mean of the areas in the individual sub-

analyze within and across subgroups. Compariamong subgroups would allow for determination

of differences therein, and the individual

areas

screening test, and the data can be naturally divided independent subgroups or strata. These groups

into g

weighted

groups, where the variance, i.e., 1/Wi

could

be combined to form a single estimate if appropriate. This subgroup analysis also opens the door to analysis of areas across studies. If area estimates were available from a number of separate studies, or from different populations, these studies or populations could be considered to be subgroups, and their results could be compared and, if appropriate, combined. This would be akin to some of the statistical analysis involved with meta-analysis.2-’ We present here a simple method of comparing and combining area estimates across subgroups or strata. The only data requirements are independent estimates of the area along with the variances or standard deviations of the estimates. The approach is quite general in that area estimates may be either parametric or nonparametric. The methods presented are essentialy the same as those used to compare and combine measures of association,5,’ but have not heretofore been used with ROC curves.

weights =

Var(A ;).

are

the

reciprocal

of the

Then

and

The

area

under the ROC

curve

is

approximately nor-

mally distributed, regardless of whether it is estimated through maximum likelihood for the binormal model’ or nonparmetrically.8 Thus, the estimate of common area, A, will also be normally distributed. A 1 - a% confidence interval for A can be constructed with limits

Received February 20, 1991, from the Department of Biostatistics, Medical College of Virginia, Richmond, Virginia 23298-0032. Revision accepted for publication December 16, 1991. Address correspondence and reprint requests to Dr. McClisli.

274

Downloaded from mdm.sagepub.com at UNIV CALGARY LIBRARY on May 24, 2015

275

where Z is the (two-sided) critical value for the standard normal random variable. Before using a common estimate of the area, it is necessary to test whether there is consistency across subgroups, i.e., whether the areas under the ROC curves are equal. If not, then it is not appropriate to consider A a measure of the common area, although it still is an average value. The test for equality of areas is the sum of squared deviations of the individual areas from the mean area, weighted by the variances as above.

This statistic has a X2 distribution with g - 1 degrees of freedom under the null hypothesis that all the area measurements come from the same population (i.e.,

Ho: A1

=

Az

=

= ...

A~).

One problem with this test for the equality of areas is that it lacks power to detect differences. This lack of power is typical of any general analysis-of-variance type of procedure. In part the poor power is due to the omnibus nature of the statistical test, which has as its alternative hypothesis that at least one of the ROC curve areas is different from some of the others. The test will certainly lack power against specific alternatives such as ROC curve areas that are expected to increase (or decrease) as the variable defining the subgroup (e.g., age) increases. Because of this, whenever the statistical test fails to reject the null hypothesis, a careful examination of the areas and their standard deviations, perhaps graphically, should be made to determine whether it is reasonable to combine estimates for further analysis, or whether systematic biases or other problems appear to characterize the differences among the ROC curve areas. When in doubt, it may be better not to combine the area estimates. A small simulation

study was

conducted

to

explore

the power of the test statistic xt10I110g’ To determine power, it was necessary to specify areas expected under the alternative hypothesis. Two extreme situations were considered.9 Generally, the worst case (i.e., most difficult to detect differences) is when the values are ar-

rayed :

On the other hand, the best case (resulting in greatest power) is when half the areas are at the minimum value and half are at the maximum value, i.e.,

The simulation

study was performed on a VAX using

SAS version 6.06. The

3 parameters studied were g 7, Amm = 0.70, 0.75, 0.80, Ama.x 0.80, 0.85, 0.90, and n = 50, 100. Once g, n, Ai - Amm and A* 9 = Ama.x were determined, the remaining alternative area values (AT, 1 < i < g) were generated either for the best case or for the worst case according to equation 5 or 6. Each run of the simulation consisted of generating g areas, each from a binormal distribution with ROC curve area A7. Nonparametric estimates of the areas and their variances were computed using the variance approximation of Hanley and McNeil.l° The test statistic (equation 4) was then calculated and compared with the critical value of a chi squared with g - 1 degrees of freedom. The proportion of times that the test statistic exceeded the critical value in 500 runs was the estimate of power (details of the simulation are available from the author). Table 1 has the results of the simulation study for each combination of parameters. For the best-case scenario, the power of X~omog increases as the number of samples compared (g) increases. In contrast, the power decreases with g for the worst case. Power also increases as Amm increases and, of course, as the difference An,,, - An,;a increases. The effects of the bestand worst-case scenarios depend on the numbers of groups being compared (g) as well as individual sample size (n). The discrepancy between the powers for the best and worst cases widens as the number of groups increases or the sample size decreases. Table 1 can be used to estimate the power of Xzhoniog in two ways. If the number of samples and range of areas to be compared can be specified and are similar to the ranges presented in the table, then the power can be estimated from the appropriate section of the table. The actual power will generally lie somewhere between the powers of the best- and worst-case scento

=

=

arios. On the other hand, if the areas are very different from those in the table, or if the variances of the area estimates are known to be quite different from the nonparametric estimate used in the simulation, then a different approach, using the columns labeled E, can

be used. These columns are somewhat akin to what Cohen’ has called &dquo;effect size&dquo; (actually E is essentially the non-centrality parameter), and can be calculated as follows. Suppose the alternative hypothesis is

Then

we

define the effect size

Downloaded from mdm.sagepub.com at UNIV CALGARY LIBRARY on May 24, 2015

as

276

laNB 1 *

Power of Test for

Homogeneity,

Number of

Independent Subgroups (g)

=

3, 4, 5, 6,

7

*&dquo;Effect size&dquo; (see text).

If the calculated value of E can be matched with a similar value in the table, the power of the test should be similar to the corresponding power presented in the table. Under certain circumstances the power of xil..&dquo;og can be improved somewhat. If the groups or strata of interest can be divided a priori into a smaller number of groups, say H groups of size gl, gz, gli, where g + then one can work with H < + + ... 9HI gl g2 of which will be some at least of larger size. g groups,’ The area for each group is estimated as

We test

equality

of the H groups

as

before with the

test statistic

... ,

=

and the

corresponding weight

which has a chi-square distribution with H - 1 degrees of freedom. Of course, we might also be interested in determining whether the H subgroups defined really are themselves homogeneous groups. We can do this by simply treating each of the h subgroups as with equation 4 above to test the null hypotheses that all the g~ ROC curves have the same area, using the statistic:

would be determined

as

Downloaded from mdm.sagepub.com at UNIV CALGARY LIBRARY on May 24, 2015

277

Note that there may be differences of

opinion as to appropriate treatment of the analysis as regards a priori versus post hoc partitioning of the data, and multiple testing. With regard to the former, Fleiss’ has suggested that the degrees of freedom for post hoc comparisons (i.e., comparisons suggested by the data as opposed to theory) be increased (to g - 1) to protect against increased type I errors due to data dredging. the

Another consideration is the number of statistical tests performed. One must always be concerned with the tradeoff between type I and type II error rates. If not controlled, the type I error rate will increase when multiple tests of hypotheses are performed and the type II error will increase (i.e., power will decrease) as type I error decreases or the degrees of freedom increase. Serious consideration must be given to the cost of each error when choosing appropriate analyses. Given possible problems with the power of the test, the methods suggested above (with the given degrees of freedom) should suffice, although caution should be used when interpreting the results if the research actually represents hypothesis generation rather than

Table 2 9 Summary of Seven

Studies of the Dexamethasone

Suppression Test-Areas, Standard Errors, Numbers of Subjects, and Weights*

*Area, SE(Area), and sample sizes from table 1 of Mossman and W.-1/[SE(A.)F. tfor complete reference citations, see the reference list.

Somoza.&dquo;

i

hypothesis testing.

Results We present here an example based on an evaluation of the dexamethasone suppression test (DST), a simple laboratory assessment of pituitary-adrenal dysregulation. (The example is intended to indicate the use of the techniques described above, and is not intended to be either a substantive explication of the DST or a meta-analysis of the issue.) It has been suggested that this simple test be used to diagnose and differentiate between various psychiatric disorders, including psychotic depression, schizophrenia, mania, and major depression. A paper by Mossman and Somoza in 198991 summarized ROC analysis of seven studies of the DST, 1218 including the estimate of the area for each curve. The area estimates were obtained using maximum likelihood.~We use data from their published results here (table 2) to compare the accuracies of the DST found in these studies and determine whether they

appear of the

to

represent

a common

underlying

estimate

area.

We can calculate a mean value of the in table 2 using equation 1:

with a standard 2) of

error

area

estimates

(square root of result in equation

This differs from the &dquo;composite&dquo; value of 0.792 determined by Mossman and Somoza.&dquo; For the composite they used the (unweighted) arithmetic mean of the slope (b,) and intercept (ai) parameter estimates form the RSCORE II program’‘’ to estimate the area as ~(a/~/1 + b&dquo;T No estimate of variance was given by Mossman and Somoza. Before interpreting the estimate A as estimating a common underlying area, we test whether the seven studies all estimate a single common value. We do this by using equation 4 to test the hypothesis that all seven studies have the same area under the ROC curve. The test statistic is computed as

which is highly significant (p < 0.001) when compared with a chi-square distribution with 7 - 1 = 6 degrees of freedom. Thus, we can conclude that the seven studies have different areas under the ROC curvethey do not indicate equivalent accuracies of using the DST as a diagnostic tool. The value of A = 0.781, if it is used at all, should not be considered to represent a common value for the area, but rather a mean value for the areas calculated from the seven studies. Further analysis might involve partitioning the studies into groups that share commonalities. For example, Mossman and Somoza indicate that two different assay methods, competitive protein binding (CPB) and radioimmunoassay (RIA), were used for the measurement of the DST. (Other a priori comparisons could involve types of controls used or times of phlebotomy) .

Downloaded from mdm.sagepub.com at UNIV CALGARY LIBRARY on May 24, 2015

278

comparison, with the lack of significance upon testing

obscuring area

the apparent difference between the two

estimates.

Results of these tests comparing the five ROC curves that used RIA indicate that the areas are significantly different. Thus, their comparison (as a group) with the mean of the two CPB curves may not be appropriate. This is borne out by examination of the actual areas, as displayed in figure 1. There it is apparent that while the mean of the areas under the RIA curves may be different from the mean of the CPB curve areas, there is an overlap in the individual areas. Based on this analysis, it would appear that the assay method is not useful for summarizing the diagnostic accuracy of the DST. Fi(,L HI. 1. Areas and standard errors for seven studies of the dexamethasone suppression test.&dquo;- &dquo;‘ RIA radioimmunoassay; CPB competitiveprotein binding. Labels under the bars indicate the authors of individual studies. See table 2 and the reference list for complete citations. =

=

Discussion We have shown how to combine information conthe areas under the ROC curve from two or more groups using a weighted average. The groups can be defined as strata, partitioning the data from a single study, or they may be independent samples from separate studies. We also showed how to test whether the ROC curve areas were the same for all groups, using a chi-square test, and explored the prob-

cerning

a priori partitioning of the studies into two groups based on assay methods might be considered reasonable. Of the seven studies, only those of Carroll et al.11 and Peselow et ap6 used CPB. Partitioning in this way 0.805 and ACPB gives average area estimates of ARIA 0.727. A test of equality of these two mean areas yields X2 = 7.04, which is significant when compared with the critical value of a chi-square with 1 degree of freedom. In addition, tests of equality of the five areas where RIA was used to assay the DST yields X~IA = 28.4, while a test of the equality of the areas from the two ROC curves where CPB was used has a test statistic of X~PB 0.305. We compare these test statistics with critical values of the chi-square distribution with 4 and 1 degrees of freedom (X¡,05 = 9.49 and Xi,05 3.84, respectively). The test of the hypothesis that the two studies using CPB have the same area under the ROC curve was not rejected. Of course, not rejecting the null hypothesis may simply indicate that there is not sufficient power to detect a difference between the two areas. In fact, simulation indicates that the power of the test to compare the two ROC curves is less than 10%. Before deciding whether to combine the two areas, an examination of the actual areas and their standard errors could also be carried out. Figure 1 has the area estimates and standard error bars for all seven DST studies. Looking at this figure and table 2, we see that the DST study by Peselow et al., although not having the smallest sample size, has the largest standard error of all seven studies-almost twice the size of the next largest, and four times as large as that of Carroll et al. This may reflect the small number of categories used for the data (four categories as opposed to five to eight for the others) and the distribution among those categories (more than 50% in the largest category).16 Thus, a lack of power would certainly be a problem in this

An

=

=

=

=

lem of the power with a small simulation study. The method described here for comparing the areas under more than two ROC curves is much simpler than the one suggested by McClish2° and appears to have similar power. It does not require complete data, only the areas and standard deviations, and it does not matter whether initial analysis decisions included the binor-

mality assumptions

or were

nonparametric.

In

fact,

the same method could also be used to compare partial areas across two or more studies or subgroups 2’ The concept of combining results from separate studies shares similarities with the burgeoning field of meta-analysis.2-4 In meta-analysis, interest centers on combining information gained in independent studies. A complete meta-analysis is a complex procedure involving a careful literature search to obtain all relevant papers, an assessment of the quality of the studies, statistical pooling procedures, and a sensitivity analysis of the results.’ The methods suggested here are relevant only to the statistical pooling procedures that might be used in a meta-analysis of ROC curve areas. In addition, meta-analysis usually involves the comparison and combination of the results of significance tests or effect sizes (e.g., results of comparing two treatments), while the work here is relevant to comparing and combining individual area measures, as opposed to the statistical results of area comparisons.

A drawback to this approach for creating and combining strata as compared with that of Tosteson and Begg’ is that it does not take advantage of continuous

Downloaded from mdm.sagepub.com at UNIV CALGARY LIBRARY on May 24, 2015

279

covariates. While continuous covariates can be categorized, the regression approach is certainly more powerful. In addition, stratification into subgroups that are too small will produce sparse data and poor estimates of individual areas. Again, the power to compare areas among strata will suffer. The chief advantage to this method is its computational simplicity, requiring only a hand calculator. In addition, since the method requires only knowledge of the area and its standard error, it should be a useful tool to compare areas across studies. The test for equality of the ROC curve areas may not be as powerful as one would like. The importance of this will vaiy. If the purpose of the analysis is to combine results from data in a single study that have been stratitied into more homogeneous groups to improve the area estimate, it may be sufficient simply to inspect the data to be sure that the areas do not seem too divergent (without formal testing), or to test the data with a conservative value for the significance level (perhaps even smaller than 0.05) to validate statistically that the areas do not differ widely. On the other hand, if the purpose of the analysis is specifically to determine the equality of areas under the ROC curves, then the power of the test is a critical issue that must be considered in the interpretation of the results.

The author thanks the reviewers for their comments, and in particular for the suggestions that led to the inclusion of figure 1 and the power

analysis.

search. Ann Intern Med. 1987;107:224-33. 5. Fleiss JL. Statistical methods for rates and 6. 7.

8.

9. 10. 11.

proportions.

New

York: John Wiley &, Sons, 1981. Breslow NE, Day NE. Statistical methods in cancer research. Lyon, France: IAHC Scientific Publications, No. 32, 1980. Dorfman DD, Alf E. Maximum likelihood estimation of parameters of signal detection theory—a direct solution. Psychometricka. 1968;33:117-24. DeLong EF, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837-46. Cohen J. Statistical power analysis for the behavioral sciences. New York: Academic Press, 1969. Hanley JA, McNeil BJ. The meaning and use of the area under an ROC curve. Radiology. 1982;143:29-36. Mossman D, Somoza E. Maximizing diagnostic information from the dexamethasone suppression test. Arch Gen Psychiat. 1989; 46:653-60.

12.

13.

14.

Aggernaes A, Kirkegaard C, Kron-Meyer I, et al. Dexamethasone suppression test and TRH test in endogenous depression. Acta Psychiatr Scand. 1983;67:258-64. Carroll BJ, Feinberg M, Greden JF, et al. A specific laboratory test for the diagnosis of melancholia: standardiztion, validation and clinical utility. Arch Gen Psychiat. 1981;38:15-22. Coppen A, Abou-Saleh M, Milln P, Metcalfe M, Harwood J, Bailey J. Dexamethasone suppression test in depression and other psychiatric illness. Br J Psychiat. 1982;142:498-504.

Mendlewicz J, Charles G, Franckson JM. The dexamethasone suppression test in affective disorder: relationship to clinical and genetic subgroups. Br J Psychiat. 1982;141:464-70. 16. Peselow ED, Goldring N, Fieve RR, Wright R: The dexamethasone suppression test in depressed outpatients and normal control subjects. Am J Psychiat. 1982;140:245-7. 17. Schatzberg AF, Rothschild AJ, Stahl JB, et al. The dexamethasone suppression test: identification of subtypes of depression. Am 15.

J

Psychiat.

1983;140:88-91.

18. Stokes PE, Stoll PM, Koslow SH, et al. Pretreatment DST and

References Tosteson AN, Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making. 1988;8:204-15. 2. Hedges LV, Olkin I. Statistical methods for meta-analysis. Orlando, Florida: Academic Press, 1985. 3. Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiol Rev. 1987;9:1-29. 4. L’Abbe KA, Detsky AS, O’Rourke K. Meta-analysis in clinical re1.

19.

hypothalamic-pituitary adrenocortical function in depressed patients and comparison groups: a multicenter study. Arch Gen Psychiat. 1984;41:257-64. Swets JA, Pickett RM. Evaluation of diagnostic techniques: methods from signal detection theory. Orlando, Florida: Academic Press, 1982.

Comparing the areas under more than two indeROC curves. Med Decis Making. 1987;7:149-55. 21. McClish DK. Analyzing a portion of the ROC curve. Med Decis Making. 1989;9:190-5. 20. McClish DK.

pendent

Downloaded from mdm.sagepub.com at UNIV CALGARY LIBRARY on May 24, 2015

Combining and comparing area estimates across studies or strata.

A method for combining and comparing medical tests across studies or strata is presented. The area under the receiver operating characteristic (ROC) c...
516KB Sizes 0 Downloads 0 Views