Sample Size Requirements for Establishing Clinical Test–Retest Standards Garnett P. McMillan1,2 and Timothy E. Hanson3 Objective: To define sample size requirements for establishing clinical serial monitoring protocols.

disruption to have occurred. We will use the phrase critical difference score in this article, though the discussion pertains to audiological shift standards for any quantitative metric. Despite the variety of names, these quantities can all be computed in the same way, and all rely on the same underlying assumptions. Accordingly, sample size requirements for establishing critical difference scores can be characterized in a unified, and, as it turns out, surprisingly simple way. This article will provide sample size recommendations for establishing critical difference scores in the following sequence: (1) define the standard theoretical model of serial measurements; (2) define sample statistics for determining the critical difference score; (3) define the variance and resulting lower confidence bound of the critical difference score; (4) define the “false-negative rate” concept; and (5) define minimum sample sizes needed to establish the critical difference score. An example using distortion product otoacoustic emission (DPOAE) input/output functions for serial monitoring of adult and pediatric cisplatin patients is provided.

Design: The 95% confidence bound of a critical difference score is defined and used to identify false-negative regions suitable for sample size calculation. Results: Reference subject sample sizes vary from about 40 to 480 subjects, depending on the minimum acceptable error rates of the clinical protocol. Conclusions: Sample size requirements for establishing test–retest standards are generally defined and suitable for any serial monitoring protocol. Key words: Critical difference scores, Sample size, Serial monitoring, Test–retest. (Ear & Hearing 2014;35;283–286)

In this article we identify sample sizes necessary to accurately define a clinical test–retest standard. A test–retest standard is a quantitative threshold beyond which a change in a patient’s measurements is considered clinically unusual. For example, the National Cancer Institute Common Toxicity Criteria for Adverse Events version 4.03 (NCI 2010) identifies a pediatric grade 1 ototoxic event as a hearing shift exceeding a 20 dB test– retest standard at 8 kHz. A pediatric cisplatin patient presenting a hearing shift that exceeds this standard may be recommended for further evaluation and possible treatment modification. Test– retest standards are ubiquitous in audiology, though they go by many names. Phrases such as “normal test–retest variability” or “retest error” are commonly used in audiology, as is “minimum clinical difference” or simply “reference limits” in other fields. The “critical difference score” is the minimum change in some metric, such as a questionnaire scale, psychophysical test, or pure-tone threshold, for a clinically significant auditory

THE STANDARD THEORETICAL MODEL OF SERIAL MEASUREMENT A critical difference score is an estimate of a particular percentile of the distribution of measurements in a standard population. A 5% critical difference score is the estimated value of the measurement below which 5% of the reference population lies. A 95% critical difference score is the percentile below which 95% of the reference distribution lies. Certain clinical problems require both an upper and a lower critical difference score, thus defining a “reference interval.” A 95% reference interval is defined by the 2.5% and 97.5% critical difference score. The 95% reference interval thus “contains” the central 95% of the reference distribution, which is bounded by the 2.5th and 97.5th critical difference scores. Reference intervals and critical difference scores must not be confused with the confidence interval of these statistics. Confidence intervals are the set of critical difference scores in the reference population that are consistent with the reference sample at the 95% confidence level. A confidence interval for the critical difference score pertains to population inferences about the critical difference score, while the critical difference score identifies a specific percentile of the reference population measurements. Tradition invokes 95% confidence intervals and also 95% reference intervals, but these two intervals have completely different interpretations despite having the same percentage. The critical difference score is established by taking repeated measurements in a reference sample of subjects who are otherwise unaffected by any insults or enhancements that might change the distribution of their measurements. This is to say that the reference population is homeostatic with respect to the auditory measurements under consideration. Let Y1 and Y2 denote baseline and follow-up measurements. It is generally

This article describes a simple method of determining sample sizes needed to establish clinical test–retest standards. This article expands on work published by Linnet (1987) by characterizing the sample size requirements in terms of the false-negative rate, which is suitable for any Gaussian data reference standard and easier to compute than methods proposed by Linnet. A variety of challenges remain. Non-Gaussian data, such as measurements with skewed distributions or step-like data such as pure-tone audiometry will require Gaussian transformation or nonparametric alternatives. In addition, audiologists regularly base their clinical judgment on the combined results of several tests. However, there is no accepted statistical method for combining screening tests, and thus no sample size formulae for efficiently generating multivariate reference standards. This situation is ubiquitous in laboratory medicine where sample size requirements based on univariate methods, such as are described in this article, are the standard for study design. 1 VA RR&D National Center for Rehabilitative Auditory Research (NCRAR), Portland VA, Medical Center, Portland, Oregon, USA; 2Oregon Health and Science University, Department of Public Health and Preventive Medicine, Portland, Oregon, USA; and 3Department of Statistics, University of South Carolina,216 LeConte College Columbia, South Carolina, USA.

0196/0202/14/352-0283/0 • Ear & Hearing • Copyright © 2013 by Lippincott Williams & Wilkins • Printed in the U.S.A. 283

284

McMILLAN AND HANSON / EAR & HEARING, VOL. 35, NO. 2, 283–286

assumed that these measurements are bivariate normal random variables, which, as a result of the homeostasis assumption, have constant mean and variance at each time point. Let these parameters equal μ and σ2, respectively. Correlation between the baseline and follow-up measurements is given by ρ. For paired measurements (Yi1, Yi2) on n individuals the theoretical model is succinctly written as



  µ   σ 2 ρσ 2  Yi1  Y  ~ N 2    ,  2 2  , i = 1,…, n   µ   ρσ σ   i2 

, where X n2 (0.95) is taken from tables of the Chi-square distribution. By definition, l is the smallest “true” critical difference score that is consistent with the data at the 95% confidence level. The lower bound l is an increasing function of sample size, so that larger samples yield a lower confidence bound that is closer to the estimated critical difference score.

THE FALSE-NEGATIVE RATE

(1)

Let (Y2 − Y1) denote the shift from baseline to follow-up. McMillan et al. (2013) show that the critical difference score, denoted c, depends on the variance of the shift, given by 2·σ2(1 − ρ).

SAMPLE STATISTICS FOR DETERMINING THE CRITICAL DIFFERENCE SCORE In a sample taken from a reference population, the variance of the shift is estimated by 2S2, where S2 is one of several estimators, including: (1) the squared “standard error of measurement” (SEM), computed as the sample variance of all the observed measurements (baseline and follow-ups combined) multiplied times one minus the Pearson’s correlation coefficient for the baseline and follow-up measurements (Demorest & Walden 1984); or as (2) the mean squared error given by the repeated measures analysis of variance model fit to the observed data, with subject i as a random factor; or as (3) one half the sample variance of the observed shifts. Multiplying by one half gives 2S2=variance of the observed shifts. These estimators are outlined in greater detail in the study by McMillan et al. (2013) along with examples. As a result, ●● The 2.5% and 97.5% critical difference scores given by ●●

±1.96 2S 2 define a 95% reference interval. The 5% and 95% critical difference scores given by

●●

±1.645 2S 2 define a 90% reference interval. The 10% and 90% critical difference scores given by

±1.282 2S 2 define an 80% reference interval. Any percentile critical difference score can be estimated by 2S 2 times the associated standard normal quantile derived from widely published tables.

A study conducted to determine the critical difference score in a reference population requires statistical consideration of the minimum sample size needed to precisely estimate the critical difference score. In principle, one could deduce sample size requirements using Linnet’s (1987) methodology for estimator (3) based on the observed shifts, but this would entail several nonintuitive steps requiring a variety of parameter inputs that are not always available to investigators. Instead, we generate sample size requirements by contrasting the lower confidence bound l with the critical difference score c. The lower bound is by definition smaller than the critical difference score, so that one always runs the risk of defining a critical difference score that is too large, if in fact, l is the true value. Thus, patients with auditory shifts between l and c will be overlooked within the serial monitoring protocol if we recommend c as opposed to l for the clinical standard. In this sense these overlooked subjects are false-negatives, analogous to the concept used in diagnostic test development. This is illustrated in Figure 1. The lower bound l is smaller than the estimated critical difference score, but because the lower confidence limit is an increasing function of sample size, l can be made arbitrarily close to the estimated critical difference score by increasing the sample size. False negatives are in the region enclosed in brackets. The investigator reduces the risk of false negatives by increasing the magnitude of the lower bound with larger samples. Note that the usual false-negative rate in diagnostic testing is a property of the test method and not the reference sample or sample size. We use the analogy only for enhancing the interpretability of the sample size calculations provided here. One minus l/c is the false-negative rate for clinical screening protocols. The false-negative rate, so defined, is 1.96 2S 2 n 2

false-negative rate = 1 −

VARIANCE AND RESULTING CONFIDENCE BOUND OF THE CRITICAL DIFFERENCE SCORE The true critical difference score is estimated from the reference sample measurements using one of the above listed methods. Therefore, the estimated critical difference score is subject to sampling imprecision, which is theoretically reduced by collecting more data. For estimator (2), Kristof (1963; eqs. 4a and 18) shows that n S 2 σ 2 (1 − ρ ) is a Chi-square random variable with n degrees of freedom, where n is the sample size. This result is exact for estimator (3) with (n − 1) substituted for n, and also holds for the SEM estimator given in (1) with reasonably large sample sizes. On the basis of Kristof’s result (see Appendix), a lower 95% confidence bound for the 97.5% critical difference score is given by l = 1.96 2S 2 n X n2 (0.95)

(

)

X n (0.95) l = 1− = 1− c 1.96 2S 2

l (small n) 0

n

(2)

X n2 (0.95)

l (larger n) c

‘false negatives’ Fig. 1. Illustration of the “false-negative rate” concept. c denotes the estimated critical difference score. l denotes the lower confidence limit of the critical difference score estimated with smaller and larger sample sizes. The bracket encloses the region in which false-negative measurements would occur if l is in fact correct.



McMILLAN AND HANSON / EAR & HEARING, VOL. 35, NO. 2, 283–286

Note that the S2 terms cancel out so that the false-negative rate does not include the statistics c or S2. Thus, for a given false-negative rate, sample size requirements are the same for any auditory metric that follows the bivariate normal model. Furthermore, sample size requirements are the same no matter what percentile one chooses for the critical difference score. Sample size requirements for a 90% critical difference score substitute 1.282 for 1.96 and cancel out, and so forth for different definitions of the critical difference score. One does not need troublesome quantities such as expected effect sizes or standard deviations to determine the minimum sample sizes for establishing clinical test–retest standards.

SAMPLE SIZE REQUIREMENTS Figure 2 plots the false-negative rate as a function of sample size. On the basis of Figure  2, an investigator requiring a false-negative rate of 10% will need about 100 subjects in the reference sample. About 40 subjects are needed for a 15% falsenegative rate. A false-negative rate below 5% will require very large samples on the order of 480 or more subjects. We reiterate that these recommendations are the same regardless of the auditory metric or the percentile of the critical difference score.

EXAMPLE OF CISPLATIN OTOTOXICITY MONITORING WITH OTOACOUSTIC EMISSIONS DPOAE input/output functions can be used to measure the smallest primary frequency stimulus level needed to elicit a valid DPOAE emission in a patient. This quantity is denoted the “DPIO threshold.” Changes in DPIO thresholds can be used for serial monitoring of patients receiving the ototoxic drug cisplatin, so that patients showing a DPIO threshold increase that is greater than a critical difference score are likely to have a reduction in cochlear integrity. Investigators will collect data in a nonclinical reference samples of adults and children to determine the clinically recommended 95% critical difference score. Sample size requirements for adults and children are likely to be different as a result of differing consequences of falsenegatives. We might prefer a critical difference score with a

285

small false-negative rate for pediatric oncology patients. This is because hearing loss at a young age can significantly deteriorate language acquisition skills that have life-long consequences for the patient. Investigators may therefore require no more than 5% false-negative rate for their critical difference score recommendation. Figure 2 shows that this requirement corresponds to a sample size of about 480 nonclinical reference subjects. A critical difference score for DPIO thresholds in adults may not require a small false-positive rate, because the life expectancy among adults is comparatively short, and language acquisition has already been established. The risk of hearing loss may be offset by the potential life-saving effect of cisplatin, so that a false-negative rate of 10% is deemed acceptable. A sample size of about 100 adult, nonclinical subjects would be suitable for establishing test–retest standards for adult cisplatin patients.

DISCUSSION Data that do not follow the bivariate normal model can be approached in three different ways. The first is to induce normality via a suitable transformation. This approach is widely discussed in the reference interval literature (Wright & Royston 1999). If the data can be transformed to normality, then the sample size requirements described in Figure 2 can be applied. A second approach uses “robust” estimators (e.g., Horn et al. 1998), though these are not widely accepted in the test development literature. A third approach is to derive the critical difference score from the empirical cumulative distribution function of the data. While this nonparametric approach seems appealing, it greatly complicates sample size estimation. We do not attempt this here, though we note that sample size requirements using nonparametric estimators are about double those needed under the normal data model (Linnet 1987). A final point pertains to reference regions for multivariate responses such as audiograms or DP-grams. There is no universally accepted method for establishing multivariate reference regions. One approach, which is conservative but an acceptable starting point, applies a Bonferroni-type adjustment for the confidence limits, so that the denominator of l is increased to X n2 (0.975) or X n2 (0.99). This will increase the magnitude of the false-negative rate at a given sample size, so that one would need more subjects in the reference sample to maintain a nominal false-positive rate. Methods of establishing multivariate reference regions, and sample size requirements to accurately do so, are under development at the National Center for Rehabilitative Auditory Research.

ACKNOWLEDGMENTS This work supported by the Department of Veterans Affairs RR&D Service (grants C4183R and C7113N) and the VA RR&D National Center for Rehabilitative Auditory Research, Portland, OR. Kelly Reavis, Kristy Knight, and Marjorie Leek provided valuable advice on the development of this article. The authors declare no conflict of interest. Fig. 2. Sample size requirements for establishing critical difference scores under different “false-negative rates” assuming a 95% lower confidence limit.

Address for correspondence: Garnett P. McMillan, Portland VA Medical Center—NCRAR, 3710 US Veterans Hospital Road, P5, Portland, OR 97239, USA. E-mail: [email protected] Received 8 February, 2013; accepted 14 August, 2013.

286

McMILLAN AND HANSON / EAR & HEARING, VOL. 35, NO. 2, 283–286

REFERENCES

Linnet, K. (1987). Two-stage transformation systems for normalization of reference distributions evaluated. Clin Chem, 33, 381–386. McMillan G. P., Reavis, K. M., Konrad-Martin, D., Dille, M. F. (2013). The statistical basis for serial monitoring in audiology. Ear Hear, 34, 610–618. National Cancer Institute. (2010). Common Terminology Criteria for Adverse Events, Version 4.0. Retrieved from http://evs.nci.nih.gov/ftp1/ CTCAE/CTCAE_4.03_2010-06-14_QuickReference_8.5x11.pdf Wright, E. M., & Royston, P. (1999). Calculating reference intervals for laboratory measurements. Stat Methods Med Res, 8, 93–112.

Demorest, M. E., & Walden, B. E. (1984). Psychometric principles in the selection, interpretation, and evaluation of communication self-assessment inventories. J Speech Hear Disord, 49, 226–240. Horn, P., Pesce, A. J., Copeland, B. E. (1998). A robust approach reference interval estimation and evaluation. Clinical Chemistry, 44(3), 622–631. Kristof, W. (1963). Statistical inferences about the error variance. Psychometrika, 28, 129–143.

APPENDIX DERIVATION OF THEORETICAL VARIANCE OF THE SHIFT AND CONFIDENCE LIMITS FOR THE CRITICAL DIFFERENCE SCORE According to the bivariate normal model, the variance of the shift (Y2 − Y1), on which the critical difference score depends, is given by var (Y2 − Y1 ) = var (Y2 ) + var (Y1 ) − 2cov (Y2 ,Y1 )

= 2σ 2 − 2ρσ 2



= 2σ 2 (1 − ρ ) ,

(1)

as was shown in the article. Define the 97.5% critical difference score as c = 1.96 2S 2 . Kristof (1963; eq. 38a) shows that n S 2 σ 2 (1 − ρ ) ~ χ 2n, where the terms are defined in our article. This tells us that Pr terms

(

in

parentheses

Pr 1.96 2S

2

(n

and

X n2 (0.975)

) < 1.96

(

applying

the

(

monotonic

2 σ (1 - ρ ) < 1.96 2S 2

2

(n

(

)

)

(

)

< n S σ (1 − ρ ) < = 0.95 . By rearranging the g z = 1 . 96 √ 2 √ z () throughout, we get transformation

X n2 (0.025)

2

X n2 (0.025)

2

X n2 (0.975)

(

)

) ) = 0.95. 95% confidence limits for the critical difference

)

score are therefore given by 1.96 2S 2 n X n2 (0.975),1.96 2S 2 n X n2 (0.025) and a 95% one-sided lower confidence limit is given by 1.96 2S

2

(n

X n2 (0.95)

) , which is defined as l in the article.

Erratum The Statistical Basis for Serial Monitoring in Audiology: Erratum In the article that appeared on page 610 of the September/October 2013 issue of Ear and Hearing, there is an error. Equation #6 on page 614, should read as follows:  ( y + y2 )  +  y − ( y1 + y2 )  2∑   y1 − 1   2    2 2 VM = N 2

2

  

The authors sincerely regret this error.

Reference: McMillan, G.P., Reavis, K.M., Konrad-Martin, D., & Dille, M.F. (2013). The statistical basis for serial monitoring in audiology. Ear Hear, 32, 610–618.

Sample size requirements for establishing clinical test-retest standards.

To define sample size requirements for establishing clinical serial monitoring protocols...
319KB Sizes 0 Downloads 0 Views