XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

Article

Generalized linear mixed models for multi-reader multi-case studies of diagnostic tests

Statistical Methods in Medical Research 0(0) 1–16 ! The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280215579476 smm.sagepub.com

Wei Liu,1,2 Norberto Pantoja-Galicia,2 Bo Zhang,2 Richard M Kotz,2 Gene Pennello,2 Hui Zhang,3 Jessie Jacob4 and Zhiwei Zhang2

Abstract Diagnostic tests are often compared in multi-reader multi-case (MRMC) studies in which a number of cases (subjects with or without the disease in question) are examined by several readers using all tests to be compared. One of the commonly used methods for analyzing MRMC data is the Obuchowski– Rockette (OR) method, which assumes that the true area under the receiver operating characteristic curve (AUC) for each combination of reader and test follows a linear mixed model with fixed effects for test and random effects for reader and the reader–test interaction. This article proposes generalized linear mixed models which generalize the OR model by incorporating a range-appropriate link function that constrains the true AUCs to the unit interval. The proposed models can be estimated by maximizing a pseudo-likelihood based on the approximate normality of AUC estimates. A Monte Carlo expectationmaximization algorithm can be used to maximize the pseudo-likelihood, and a non-parametric bootstrap procedure can be used for inference. The proposed method is evaluated in a simulation study and applied to an MRMC study of breast cancer detection. Keywords AUC, diagnostic medicine, EM algorithm, pseudo-likelihood, random effect, ROC curve

1 Introduction Diagnostic procedures with continuous or ordinal test results are routinely evaluated using the receiver operating characteristic (ROC) curve.1–3 Summaries of ROC curves such as the area under the curve (AUC) are frequently used to compare different tests. To account for the 1

Department of Mathematics, Harbin Institute of Technology, Harbin, P. R. China Division of Biostatistics, Office of Surveillance and Biometrics, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, MD, USA 3 Department of Biostatistics, St Jude Children’s Research Hospital, Memphis, Tennessee, USA 4 Medical Affairs, GE Healthcare, Milwaukee, Wisconsin, USA 2

Corresponding author: Zhiwei Zhang, Division of Biostatistics, Office of Surveillance and Biometrics, Center for Devices and Radiological Health, Food and Drug Administration, 10903 New Hampshire Avenue, WO66, Room 2220, Silver Spring, MD 20993, USA. Email: [email protected]

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

2

Statistical Methods in Medical Research 0(0)

variability between readers, such comparisons are often performed in a multi-reader multi-case (MRMC) setting, where several readers examine many cases (subjects with or without the disease in question) using all tests to be compared. The MRMC design is efficient in that different tests are compared on the same readers and subjects, and informative about not only the overall difference between tests but also the variability between readers. On the other hand, the analysis of MRMC data is challenged by the correlation arising from the same cases being examined by multiple readers using different tests. The available methods for analyzing MRMC data have been reviewed and compared.1,4,5,6 Some of these methods are based on a model for the case-level test result,7–9 which may be a numerical score representing a radiologist’s confidence level about the presence of breast cancer, for instance. Under this approach, the quantities of interest (e.g. AUC) are derived from a regression model for the test result, which may include a fixed effect for test and random effects for reader, case and their interactions with test and with each other. This approach is easy to implement using standard software for correlated data, and potentially efficient in that it works directly with test results. On the other hand, this approach typically requires a large set of modeling assumptions, some of which are not relevant to the scientific question, and the resulting inference may be sensitive to model misspecification. Moreover, the parameters in a model for the test results may not be directly interpretable in terms of diagnostic accuracy. Other methods for MRMC studies work with summary statistics computed using all cases in the sample, such as an AUC estimate for each combination of reader and test.10–13 The methods of Dorfman et al.10 and Beiden et al.12 are based on jackknife pseudo-values, which may be difficult to interpret.1 The method of Gallas13 is based on the average of the reader-specific AUC estimates (for each test), together with a closed-form variance estimate derived from the theory of U-statistics. This approach provides an overall comparison of tests but is limited to that objective (i.e. no estimates of the variability between readers). The Obuchowski–Rockette (OR) model implies that the true AUC for each combination of reader and test follows a linear mixed model (LMM) with random effects for reader and the reader–test interaction.14 Although the OR method has proved useful for comparing tests, it may be inadequate in some situations for describing the variability between readers because the aforementioned LMM allows the true AUCs to take values outside the unit interval. This article aims to generalize the OR model by incorporating a non-linear link function which constrains the true AUCs to the unit interval. The use of a non-linear link has been considered by Song and Zhou15 in the context of marginal AUC regression analysis with fixed reader effects. In MRMC studies, it may be more natural to associate readers with random effects, which help generalize the study data to the population of readers. Therefore, in this article we consider generalized linear mixed models (GLMMs) for MRMC data with a range-appropriate link function, fixed effects for test, and random effects for reader and possibly the reader–test interaction. Such a GLMM cannot be estimated using existing methods because true AUCs are never observed; instead, we work with AUC estimates for different test–reader combinations. The AUC estimates are approximately normally distributed for large numbers of diseased and nondiseased subjects, with a variance matrix that can be consistently estimated. This observation suggests a pseudo-likelihood approach based on the approximate normality of AUC estimates with an estimated variance matrix. The GLMM parameters can be estimated by maximizing the pseudo-likelihood using a Monte Carlo expectation-maximization (MCEM) algorithm, and variance estimates can be obtained by bootstrapping both subjects and readers. The proposed GLMMs and the associated estimation procedures are fully described in Sections 2 and 3, respectively. The method is evaluated in a simulation study and applied to a real example, and the numerical results are presented in Section 4. The article ends with a discussion in Section 5.

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

Liu et al.

3

2 Models Consider an MRMC study comparing I tests on the basis of J readers and K subjects, with Yijk denoting the test result for the ith test, the jth reader, and the kth subject. Let Dk indicate the true disease status of the kth subject, so Dk ¼ 1 if the subject is diseased and 0 otherwise. Let Aij be the true AUC for the jth reader using the ith test; formally, this is defined as Aij ¼ EfhðYijk0 , Yijk1 ÞjDk0 ¼ 0, Dk1 ¼ 1g where hð, Þ is a kernel function. For ordinal and semi-continuous test results, it is common to use hð y0 , y1 Þ ¼ Ið y0 5 y1 Þ þ 0:5Ið y0 ¼ y1 Þ

ð1Þ

where IðÞ is the indicator function. For continuous test results, this is equivalent to hð y0 , y1 Þ ¼ Ið y0 5 y1 Þ. Assuming that an appropriate kernel function has been chosen, we consider the following GLMMs for the true AUCs: Aij ¼ gði þ j Þ

ð2Þ

Aij ¼ gði þ j þ ij Þ

ð3Þ

where g is an inverse link function, i is a fixed test effect, and j and  ij are random effects. For the identity link, model (3) is equivalent to the corresponding assumptions in the OR model,14 which allow Aij to take values outside the unit interval. Here, we are most interested in range-appropriate link functions such as probit and logit. Model (2) assumes that g1 ðAij Þ depends on test and reader in an additive fashion; for instance, g1 ðA2j Þ  g1 ðA1j Þ ¼ 2  1 regardless of j. Model (3) allows an interaction between test and reader, so that g1 ðA2j Þ  g1 ðA1j Þ ¼ 2  1 þ 2j  1j may now depend on j. Model (3) is obviously more general than model (2), but the latter is easier to interpret and summarize. Both models can be useful tools, and one of them may be more appropriate than the other in a particular application. For model (2), we assume that the j are independently distributed as Nð0, 2 Þ. This implies that the AUC of test i for a typical reader (in the sense that j ¼ 0) is just gði Þ, and that the mean AUC of test i in the population of all readers is Z Ai ¼ gði þ Þð; 0, 2 Þ d where ð; ,  2 Þ is the density function of the normal distribution with mean  and variance  2 . For the probit link, the above expression simplifies into Ai ¼ fð1 þ 2 Þ1=2 i g, where  is the standard normal distribution function. For model (3), the random effects are assumed to be jointly normal with Eðj Þ ¼ Eðij Þ ¼ 0 varðj Þ ¼ 2 varðij Þ ¼ 2 covðj , ij Þ ¼ covðij , i0 j Þ ¼ 0,

ði 6¼ i0 Þ

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

4

Statistical Methods in Medical Research 0(0)

and total independence across j. Under this model, the AUC of test i for a typical reader (in the sense that j þ ij ¼ 0) is again gði Þ, and the mean AUC of test i in the population of all readers is ZZ Z 2 2 Ai ¼ gði þ  þ Þð; 0,  Þð; 0,  Þ d d ¼ gði þ cÞðc; 0, 2 þ 2 Þ dc ð4Þ which becomes Ai ¼ fð1 þ 2 þ 2 Þ1=2 i g for the probit link. To gain some intuition for the proposed models, let us consider the following model for the caselevel test results:   Yijk ¼ mid þ Rjd þ Ckd þ ðmRÞijd þ ðmCÞikd þ ðRCÞjkd þ ijkd ð5Þ where is a strictly increasing function, d ¼ Dk, mid is a fixed effect for test and disease status, Rjd is a random effect for reader and disease status, Ckd is a random effect for subject (or case), ðmRÞijd , ðmCÞikd and ðRCÞjkd are random interactions, and ijkd is a random error. The random variables Ckd, ðmCÞikd , ðRCÞjkd and ijkd are assumed to be normally and independently 2 2 2 2 distributed with common mean 0 and respective variances Cd , mCd , RCd and d . Similar 16 models, with being the identity function, have been considered by Roe and Metz and Song and Zhou.15 For any , it follows from model (5) and the assumed normality that   Aij ¼  fmi1  mi0 þ Rj1  Rj0 þ ðmRÞij1  ðmRÞij0 g=s where s2 ¼

P1

d¼0

2 2 2 2 ðCd þ mCd þ RCd þ d Þ. This corresponds to model (3) with the probit link and

i ¼ ðmi1  mi0 Þ=s j ¼ ðRj1  Rj0 Þ=s ij ¼ fðmRÞij1  ðmRÞij0 g=s Let us assume further that ðRj0 , Rj1 Þ  MVNð0, R2 Cð R ÞÞ 2 Cð mR ÞÞ ððmRÞij0 , ðmRÞij1 Þ  MVNð0, mR

independently of each other and of the other random effects and errors, with MVN denoting the multivariate normal distribution and Cð Þ the compound-symmetric correlation matrix with correlation coefficient . Then j and  ij follow the specified distributions with 2 ¼ 2ð1  R ÞR2 =s2 2 2 ¼ 2ð1  mR ÞmR =s2

If mR ¼ 1 (i.e. ðmRÞij0  ðmRÞij1 ), then 2 ¼ 0 and model (2) holds. Of course, models (2) and (3) can hold true without requiring (5).

3 Parameter estimation Because the Aij are not observed, we work with estimated AUCs, say A^ ij , in estimating the parameters in models (2) and (3). An estimate A^ ij may be obtained using any one of the existing

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

Liu et al.

5

methods for estimating the AUC of a given test with a fixed reader. For example, a common choice is the non-parametric estimate X X 1 A^ ij ¼ hðYijk0 , Yijk1 Þ ð6Þ K0 K1 k :D ¼0 k :D ¼1 0

k0

1

k1

P where K1 ¼ K k¼1 Dk and K0 ¼ K  K1 . This is the well-known Wilcoxon–Mann–Whitney statistic, and its asymptotic behavior (K0 , K1 ! 1) follows the theory of U-statistics.17 Alternatively, the bi-normal model or a parametric model for Yijk can be used to derive an estimate of Aij, which is also asymptotically normal. For all of these commonly used estimators, when K0 and K1 are both large, we have the following approximation: b  MVNðA, D Þ A b A

ð7Þ

b ¼ ðA^ ij Þ and A ¼ ðAij Þ. Note that the above approximation is included in the OR model, where A which imposes a simple structure on Db.11,14 Specifically, the OR model assumes that A 8 2 if i ¼ i0 , j ¼ j0  > > > < Cov if i 6¼ i0 , j ¼ j0 1 ð8Þ covðA^ ij , A^ i0 ,j0 Þ ¼ > Cov2 if i ¼ i0 , j 6¼ j0 > > : Cov3 if i 6¼ i0 , j 6¼ j0 b our approach requires an estimate of D , say b In addition to A, Db, which may be structured as above b A A or completely unstructured. For example, the theory of U-statistics can be used to derive an 18 unstructured b Db for the non-parametric estimate (6). . If necessary, an unstructured b Db can be A A modified to satisfy (8), by averaging its elements within each category defined by the relationship between (i, j) and ði0 , j0 Þ. b b The proposed method for parameter estimation is based on A, Db and the approximation (7). To A avoid repetition, we use unified notation for models (2) and (3), with a ¼ ði Þ representing the fixed effects and g representing all the random effects; thus g ¼ ðj Þ for model (2) and g ¼ ðj , ij Þ for model (3). In general, g  MVNð0, Dg Þ. The assumed independence across readers (see Section 2) allows the variance matrix Dg to be modeled parsimoniously, say with parameter x. Specifically, x ¼ 2 for model (2) and x ¼ ð2 , 2 Þ for model (3). We propose to estimate h ¼ ða, xÞ by b maximizing the following pseudo-likelihood for h based on A: Z b ¼ ðA; b Aða, gÞ, b Lðh; AÞ DbÞðg; 0, Dg ðxÞÞ dg ð9Þ A

where ð; l, DÞ is the MVN density with mean l and variance D and the notation Aða, gÞ is to emphasize the dependence on parameters and random effects. This is a pseudo-likelihood because b is an approximation and the variance matrix b the normality of A Dbis an estimate rather than the truth. A To maximize the pseudo-likelihood (9), we use the following MCEM algorithm. We treat the b as observed data, and work with the complete-data random effects g as missing data and A likelihood b gjhÞ ¼ ðA; b Aða, gÞ, b f ðA, DbÞðg; 0, Dg ðxÞÞ A

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

ð10Þ

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

6

Statistical Methods in Medical Research 0(0) The MCEM algorithm consists of the following two steps (to be iterated until convergence).

E-STEP. At the ðr þ 1Þth iteration, we calculate the Q-function Z ðrÞ b gjhÞgf ðgjA, b hðrÞ Þ dg Qðhjh Þ ¼ flog f ðA,

ð11Þ

b hðrÞ Þ is the conditional density of g given A, b evaluated at hðrÞ , the current estimate of h. where f ðgjA, The integral in (11) is evaluated using a Monte Carlo numerical integration algorithm based on Gibbs sampling and the Metropolis–Hastings algorithm. The algorithm involves updating in turn every element of g, denoted by s, using the full conditional distributions b hðrÞ Þ / f ðA, b gjhðrÞ Þ f ð s jgðsÞ , A,

ð12Þ

where gðsÞ ¼ gnf s g. Specifically, to draw a Markov chain fgð1Þ , . . . , gðNÞ g from the conditional b hðrÞ Þ using Gibbs sampler, we start from a set of initial values gð0Þ with distribution f ðgjA, ð0Þ b ðrÞ f ðg jA, h Þ 4 0. At the nth step of the Gibbs sampling procedure, a candidate s is drawn from a proposal distribution f ð s j ðn1Þ , hðrÞ Þ for each s. The proposal distribution we use is a normal s ðrÞ 2 ðn1Þ distribution, Nð s , ð0:5 Þ Þ or Nð ðn1Þ , ð0:5ðrÞ Þ2 Þ, according as s ¼ j or  ij. The sampled s is s accepted as ðnÞ s with probability ( ) b hðrÞ Þ f ð ðn1Þ j s , hðrÞ Þ f ð s je gðsÞ , A, s min 1, b hðrÞ Þ f ð s j ðn1Þ f ð ðn1Þ je g , A, , hðrÞ Þ s s ðsÞ

ðnÞ where e gðsÞ is gðn1Þ nf ðn1Þ g whose first s – 1 elements have been updated as ð ðnÞ s 1 , . . . , s1 Þ. If s is not ðn1Þ accepted, we set ðnÞ . The ðn þ 1Þth step of Gibbs sampling starts once all elements of gðn1Þ s ¼ s have been updated into gðnÞ . This procedure is repeated until a Markov chain fgð1Þ , . . . , gðNÞ g is obtained. Deleting the first N0 draws fgð1Þ , . . . , gðN0 Þ g for burn-in, we then obtain an approximation to QðhjhðrÞ Þ as ðrÞ ^ Qðhjh Þ¼

N X 1 b Aða, gðnÞ Þ, b flog ðA; DbÞ þ log ðgðnÞ ; 0, Dg ðxÞÞg A N  N0 n¼N þ1

ð13Þ

0

ðrÞ ^ M-STEP. The M-step is to find hðrþ1Þ ¼ ðaðrþ1Þ , xðrþ1Þ Þ ¼ arg maxh Qðhjh Þ. Because the parameters a and x are separated on the right-hand side of (13), we have

aðrþ1Þ ¼ arg maxa

N X n¼N0 þ1

xðrþ1Þ ¼ arg maxx

N X

b Aða, gðnÞ Þ, b logðA; DbÞ A

logðgðnÞ ; 0, Dg ðxÞÞ

n¼N0 þ1

For xðrþ1Þ , the maximization can be done analytically to yield ð2 Þðrþ1Þ ¼

J N X X 1 ððnÞ Þ2 JðN  N0 Þ j¼1 n¼N þ1 j 0

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

Liu et al.

7

for models (2) and (3), and ð2 Þðrþ1Þ ¼

I X J N X X 1 ð ðnÞ Þ2 IJðN  N0 Þ i¼1 j¼1 n¼N þ1 ij 0

for model (3). Upon convergence, the MCEM algorithm produces the maximum pseudo-likelihood estimate b h. We consider several approaches to estimating the variance of b h. Obviously, the method of Louis19 can be used to calculate the observed information matrix associated with the pseudo-likelihood (9). However, because the pseudo-likelihood does not account for the variability in b Db, the associated A information matrix is likely to over-estimate the available information about h and thus underestimate the variability of b h. Therefore, we also consider bootstrap procedures for variance estimation. One possible approach is to bootstrap subjects but not readers, and the resulting variance estimate accounts for the sampling variability in subjects but not readers. To address both sources of variability, we can bootstrap subjects and readers simultaneously. The latter procedure is intuitively appealing, although it does raise a computational issue: an unstructured b Db will be singular for a bootstrap sample in which some readers are sampled more than once. This A problem can be solved, however, by imposing a structure such as (8) on b Db. A

4 Numerical results 4.1 The CRRS-4 study The numerical results to be presented in this section are based on a study of breast cancer detection which actually motivated this research. The somo-v Automated Breast Ultrasound System (ABUS) is an adjunct to mammography for breast cancer screening in asymptomatic women for whom screening mammography findings are normal or benign, with dense breast parenchyma, and have not had a previous clinical breast intervention. The device is intended to increase breast cancer detection in the described patient population. An MRMC study known as the Clinical Retrospective Reader Study (CRRS-4; ClinicalTrials.gov identifier: NCT01424956) was conducted to evaluate the safety and effectiveness of ABUS for its intended use. The study compared X-ray mammography (XRM) alone (test 1) with ABUS þ XRM (test 2) with respect to the AUC for detecting breast cancer. In this article, we work with XRM and ABUS þ XRM images from K ¼ 164 women (K1 ¼ 31 with breast cancer; K0 ¼ 133 without). Each of these 328 (I  K, I ¼ 2) images was read by each of J ¼ 17 readers who were trained in breast imaging with 10 years of practical experience and who received additional training for using ABUS. The readers first interpreted the XRM images and recorded their image readings on an electronic case report form. After reading all XRM images, the readers then reviewed the ABUS þ XRM images and recorded their readings. The recorded readings included a 0–100% rating of malignancy (for ROC analyses) as well as more discrete assessments (for estimating sensitivity and specificity). The 0–100% rating will be treated as a semi-continuous test result in our analyses. As a preliminary analysis, Figure 1 shows, for each test, a non-parametric ROC curve estimate for each reader as well as a pooled ROC curve estimate for all readers. It appears that the ROC curves for ABUS þ XRM are generally higher than those for XRM, with a fair amount of variation between readers. Figure 2 further describes the between-reader variation and the within-reader association of the AUCs for the two tests. Each circle in Figure 2 represents a different reader, and the coordinates are non-parametric AUC estimates (for XRM and ABUS þ XRM) given by equation (6) with hð y0 , y1 Þ defined by (1). Figure 2 suggests that readers differ in their general

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

Statistical Methods in Medical Research 0(0)

0.2

0.2

0.4

0.4

TPR

TPR

0.6

0.6

0.8

0.8

1.0

1.0

8

0.0

0.2

0.4

0.6

0.8

ABUS+XRM

0.0

0.0

XRM

0.0

1.0

0.2

0.4

0.6

0.8

1.0

FPR

FPR

0.7 0.5

0.6

ABUS+XRM

0.8

0.9

Figure 1. Non-parametric estimates of reader-specific and overall ROC curves for XRM (left) and ABUS þ XRM (right) based on the CRRS-4 data. The dashed diagonal line corresponds to a diagnostic test based on random guesses.

0.5

0.6

0.7

0.8

0.9

XRM

Figure 2. Non-parametric estimates of reader-specific AUCs for XRM and ABUS þ XRM based on the CRRS-4 data. Each circle represents a reader, and the coordinates are AUC estimates for XRM and ABUS þ XRM. The dashed diagonal line corresponds to equivalence of the two tests. The mixed line is a simple linear regression for the scatter plot.

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

Liu et al.

9

diagnostic accuracy (regardless of test) and possibly in the extent to which they benefit from the use of ABUS in addition to XRM. These observations suggest that model (3), with a random effect for reader and a random interaction between reader and test, may be appropriate for the CRRS-4 study.

4.2

Simulation results

Before presenting our analysis of the CRRS-4 data, we first report a simulation study to assess the performance of the proposed method in comparison to the OR method. This simulation study is focused on model (3) because it seems appropriate for the CRRS-4 study and also because it is more complex and thus more challenging to fit than model (2). Data generation is based on model (5) with being the identity function and with C0 ¼ C1 ,

mC0 ¼ mC1 ,

RC0 ¼ RC1 ,

0 ¼ 1

in addition to the distributional assumptions stated in Section 2. This implies that model (3) holds with g being the probit function. In all experiments but the last one (to be described later), the parameters in model (5) are set at the following values: m10 ¼ 8:61,

m11 ¼ 16:32,

m20 ¼ 11:03,

Cd ¼ 8:50, mCd ¼ 7:54, RCd ¼ 7:69, R0 ¼ 10:62, R1 ¼ 18:39, R ¼ 0:90 mR0 ¼ 1:09,

mR1 ¼ 7:33,

m21 ¼ 42:33 d ¼ 12:34

ðd ¼ 0, 1Þ

mR ¼ 0:72

based on a restricted maximum likelihood analysis of the CRRS-4 data using the lmer function in R. The implied values of the parameters in model (3) are shown in Table 1. We fix I ¼ 2 and experiment with different combinations of ðK0 , K1 , JÞ (to be specified later). For each combination, 500 MRMC datasets are simulated. Each simulated dataset is analyzed using the proposed method as well as the OR method as described by Hillis.14 Both methods are used to estimate Ai (i ¼ 1, 2), the mean AUC (across readers) for each test, as well as the difference A2  A1 , which represents the improvement in mean AUC of ABUS þ XRM over XRM alone. In addition, the proposed method also estimates the parameters in model (3), which can be used to describe the distribution of reader-specific AUCs for each test. Under model (3), Ai is derived as (4) and estimated by substituting estimates of model parameters. b consists of the non-parametric estimates given by (6), where hð y0 , y1 Þ is defined For both methods, A by (1), and b Db is a structured estimate based on (8). (Although the proposed method could use an A unstructured variance estimate based on the theory of U-statistics, preliminary simulation experiments suggest that the unstructured estimate does not perform as well as the structured one.) The MCEM algorithm described in Section 3, with N ¼ 1000 and N0 ¼ 200, is used to maximize the pseudo-likelihood (9). The proposed method makes inference using the analytic approach of Louis19 and a non-parametric bootstrap procedure based on 200 bootstrap samples of subjects and readers (to be drawn with replacement). The analytic approach produces 95% Wald confidence intervals (CIs), and the bootstrap procedure produces both Wald and percentile CIs. (Note that this bootstrap procedure would not be feasible for an unstructured estimate of Db, which A is singular when a reader is sampled more than once.) The OR method does not involve bootstrapping; its inference is based on analytic standard errors and Wald CIs. Table 1 shows the simulation results (empirical bias, empirical standard deviation, average standard error, and empirical coverage probability) for both the proposed method and the

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

10

Statistical Methods in Medical Research 0(0)

Table 1. Simulation results for the proposed method (with the probit link) and the OR method: empirical bias, empirical standard deviation (SD), average standard error (SE) and empirical coverage probability (CP) for estimating A1, A2, A2  A1 and (for the proposed method) the parameters in model (3), in a situation that resembles the CRRS-4 study. The proposed method makes inference using the analytic approach of Louis19 and a non-parametric bootstrap procedure. The analytic approach produces 95% Wald confidence intervals (CIs), and the bootstrap procedure produces both Wald and percentile CIs. Proposed method

OR method

Louis19

Bootstrap

Parameter

True Value

Bias

SD

SE

1 2 2  1   A1 A2 A2  A1

0.30 1.20 0.90 0.38 0.25 0.61 0.86 0.26

0.00 0.05 0.05 0.01 0.02 0.00 0.01 0.01

0.15 0.19 0.16 0.10 0.07 0.05 0.04 0.04

0.15 0.20 0.16 0.10 0.07 0.05 0.04 0.04

1 2 2  1   A1 A2 A2  A1

0.30 1.20 0.90 0.38 0.25 0.61 0.86 0.26

0.01 0.05 0.04 0.01 0.03 0.00 0.01 0.00

0.15 0.18 0.16 0.10 0.06 0.05 0.03 0.04

0.16 0.21 0.16 0.11 0.07 0.05 0.04 0.04

K0 ¼ K1 0.91 0.91 0.92 0.93 0.88 0.90 0.87 0.92

¼ 50, 0.16 0.20 0.17 0.10 0.07 0.05 0.04 0.04

1 2 2  1   A1 A2 A2  A1

0.30 1.20 0.90 0.38 0.25 0.61 0.86 0.26

0.02 0.05 0.03 0.03 0.04 0.01 0.01 0.00

0.17 0.22 0.19 0.13 0.08 0.06 0.04 0.05

0.18 0.22 0.19 0.14 0.09 0.06 0.04 0.05

K0 ¼ K1 0.91 0.92 0.91 0.90 0.81 0.91 0.89 0.90

1 2 2  1   A1 A2 A2  A1

0.30 1.20 0.90 0.38 0.25 0.61 0.86 0.26

0.02 0.05 0.03 0.05 0.08 0.00 0.01 0.00

0.23 0.28 0.23 0.18 0.11 0.08 0.05 0.06

0.22 0.26 0.20 0.16 0.09 0.07 0.05 0.06

CPpercentile

Bias

SD

SE

CPWald

0.94 0.92 0.93 0.94 0.96 0.94 0.92 0.94

0.00 0.00 0.00

0.05 0.03 0.04

0.04 0.04 0.04

0.90 0.99 0.95

J ¼ 17 0.96 0.97 0.98 0.95 0.93 0.95 0.92 0.94

0.94 0.94 0.94 0.94 0.96 0.94 0.94 0.95

0.00 0.00 0.00

0.05 0.03 0.04

0.04 0.04 0.04

0.88 0.99 0.94

¼ 50, 0.18 0.24 0.21 0.13 0.09 0.06 0.04 0.05

J ¼ 10 0.95 0.98 0.96 0.91 0.87 0.94 0.92 0.96

0.94 0.93 0.94 0.90 0.92 0.94 0.94 0.95

0.00 0.00 0.00

0.06 0.04 0.05

0.05 0.05 0.05

0.90 0.99 0.94

K0 ¼ K1 ¼ 50, 0.84 0.23 0.90 0.33 0.88 0.30 0.83 0.15 0.68 0.09 0.84 0.08 0.87 0.05 0.85 0.07

J¼5 0.94 0.98 0.98 0.80 0.66 0.92 0.90 0.94

0.92 0.93 0.94 0.77 0.73 0.92 0.91 0.93

0.00 0.00 0.00

0.08 0.05 0.06

0.06 0.06 0.06

0.89 0.98 0.94

CPWald

SE

CPWald

K0 ¼ 131, K1 ¼ 33, J ¼ 17 0.91 0.15 0.96 0.91 0.20 0.96 0.91 0.17 0.96 0.94 0.10 0.96 0.86 0.07 0.94 0.91 0.05 0.96 0.88 0.03 0.92 0.92 0.04 0.95

(continued)

Downloaded from smm.sagepub.com at CMU Libraries - library.cmich.edu on September 26, 2015

XML Template (2015) [1.4.2015–12:31pm] [1–16] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150021/APPFile/SG-SMMJ150021.3d (SMM) [PREPRINTER stage]

Liu et al.

11

Table 1. Continued. Proposed method

OR method 19

Bootstrap

Louis Parameter

True Value

Bias

SD

SE

1 2 2  1   A1 A2 A2  A1

1.19 1.93 0.74 0.38 0.25 0.86 0.96 0.10

0.04 0.13 0.09 0.02 0.06 0.01 0.01 0.00

0.16 0.25 0.23 0.11 0.09 0.03 0.02 0.03

Higher AUCs, 0.17 0.88 0.25 0.89 0.20 0.89 0.13 0.94 0.08 0.70 0.03 0.83 0.01 0.77 0.03 0.87

SE

CPWald

CPWald

CPpercentile

K0 ¼ K1 ¼ 50, J ¼ 17 0.17 0.96 0.94 0.36 0.98 0.91 0.33 0.99 0.94 0.12 0.96 0.97 0.09 0.83 0.90 0.03 0.94 0.95 0.02 0.83 0.89 0.03 0.96 0.96

Bias

SD

SE

CPWald

0.00 0.00 0.00

0.03 0.01 0.03

0.02 0.02 0.02

0.86 1.00 0.94

OR method. The OR method is generally unbiased for estimating A1, A2 and A2  A1 , and has good coverage for A2  A1 , although it appears to have an undercoverage problem for A1 and an overcoverage problem for A2. In what follows, we focus on the performance of the proposed method in different scenarios. The first section of Table 1 corresponds to the setting of the CRRS-4 study, with K0 ¼ 133, K1 ¼ 31 and J ¼ 17. In this case, it appears that 2, 2  1 and  are more susceptible to bias than the other parameters. When compared with the true parameter values, the observed biases are generally small (

Generalized linear mixed models for multi-reader multi-case studies of diagnostic tests.

Diagnostic tests are often compared in multi-reader multi-case (MRMC) studies in which a number of cases (subjects with or without the disease in ques...
318KB Sizes 4 Downloads 22 Views