Scandinavian Journal of Clinical & Laboratory Investigation, 2014; 74: 713–715

ORIGINAL ARTICLE

Determining sample size when assessing mean equivalence

ARNE ÅSBERG, KRISTINE B. SOLEM & GUSTAV MIKKELSEN

Scand J Clin Lab Invest Downloaded from informahealthcare.com by Fudan University on 05/07/15 For personal use only.

Department of Clinical Chemistry, Trondheim University Hospital, Trondheim, Norway Abstract Background. When we want to assess whether two analytical methods are equivalent, we could test if the difference between the mean results is within the specification limits of 0 ⫾ an acceptance criterion. Testing the null hypothesis of zero difference is less interesting, and so is the sample size estimation based on testing that hypothesis. Power function curves for equivalence testing experiments are not widely available. In this paper we present power function curves to help decide on the number of measurements when testing equivalence between the means of two analytical methods. Methods. Computer simulation was used to calculate the probability that the 90% confidence interval for the difference between the means of two analytical methods would exceed the specification limits of 0 ⫾ 1, 0 ⫾ 2 or 0 ⫾ 3 analytical standard deviations (SDa), respectively. Results. The probability of getting a nonequivalence alarm increases with increasing difference between the means when the difference is well within the specification limits. The probability increases with decreasing sample size and with smaller acceptance criteria. We may need at least 40–50 measurements with each analytical method when the specification limits are 0 ⫾ 1 SDa, and 10–15 and 5–10 when the specification limits are 0 ⫾ 2 and 0 ⫾ 3 SDa, respectively. Conclusions. The power function curves provide information of the probability of false alarm, so that we can decide on the sample size under less uncertainty. Key Words: Chemistry techniques, analytical/methods, quality control, reproducibility of results, sample size, systematic bias

Introduction In clinical chemistry we often compare the means of measurements. For instance, when a vendor starts delivering some reagent from a new batch we might do several measurements on the same sample material with the new and the old reagents and compare the means. Another example is a laboratory equipped with two instruments for measuring the concentration of the same analyte, where several measurements on the same sample material might be used to examine the systematic difference between the instruments. In both cases, we want the difference between the means to be within the specification limits of 0 ⫾ an acceptance criterion. This is called equivalence testing [1]. It differs from traditional hypothesis testing where the null hypothesis is that the difference between the means is zero. We accept the two reagents or the two instruments as equivalent even if the difference between the means is not zero, as long as the difference is within 0 ⫾ an acceptance criterion. How do we assess the acceptance criterion?

That is not a trivial question [2,3]. However, given that the acceptance criterion is determined, we conclude on equivalence if a confidence interval for the difference between the means, for instance the 90% confidence interval, lies between the upper and the lower specification limit [1,2]. If the confidence interval does not include zero, we still conclude on equivalence as long as the confidence interval lies between the specification limits. In traditional hypothesis testing, however, we reject the null hypothesis of equal means if the confidence interval does not include zero. Also when planning the experiment we think differently when we use equivalence testing compared to traditional hypothesis testing. If the real difference between the means is within the specification limits but far from zero, we need a large sample size to prove equivalence because in that case we need a very small confidence interval for the difference between the means so that the confidence interval do not exceed the nearest specification limit.

Correspondence: Arne Åsberg, Department of Clinical Chemistry, Trondheim University Hospital, N-7006 Trondheim, Norway. E-mail: arne.aasberg@ stolav.no (Received 5 June 2014 ; accepted 8 August 2014 ) ISSN 0036-5513 print/ISSN 1502-7686 online © 2014 Informa Healthcare DOI: 10.3109/00365513.2014.953993

A. Åsberg et al.

Methods We used computer simulation to calculate the probability that the 90% confidence interval for the difference between two means, based on a certain number of measurements, would exceed the specification limits of 0 ⫾ 1, 0 ⫾ 2 and 0 ⫾ 3 analytical standard deviations (SDa) when true value of the difference is between 0 and 1, 2 and 3 SDa, respectively. In each case the computer program drew a specified number of random figures from two Gaussian (Normal) distributions with equal standard deviations. Then the program calculated the two means, the difference between the means and the 90% confidence interval for the difference using the appropriate t-value [4]. If the 90% confidence interval exceeded the specification limits, a nonequivalence alarm was registered. This procedure was repeated 1 million times for each point on each power function curve. For each point the probability of alarm was calculated as the number of alarms divided by 1 million. We determined 21 equidistant points on each power function curve. For the simulations we used the software Statistics101 version 2.8 (http://www.statistics101.net) and to construct the curves we used linear interpolation with Stata version 13 (http://www.stata.com).

(A)

Acceptance criterion: 1 SDa

1

Measurements

5 10 15 20 40 60

Probability of alarm

0.8

0.6

0.4

0.2

0 0

0.25

0.5

0.75

1

Systematic difference, SDa

(B)

Acceptance criterion: 2 SDa

1

Measurements

5 10 15 20 40 60

0.8 Probability of alarm

In traditional hypothesis testing a much smaller sample size is needed in this situation, because then we are interested in the confidence interval in relation to zero. So equivalence testing is different from traditional hypothesis testing. In the planning of an equivalence testing experiment we can not use the traditional way of estimating the sample size, and power function curves for equivalence testing experiments are not widely available. In this paper we present power function curves where the probability of getting a nonequivalence alarm is plotted against the difference between the means. These curves bear a resemblance to the commonly used power function curves of analytical quality control. They can be used when deciding on the sample size in testing equivalence between the means of two analytical methods. Then the sample size is the number of measurements of the same sample material with each method.

0.6

0.4

0.2

0 0

0.5

1

1.5

2

Systematic difference, SDa

(C)

Acceptance criterion: 3 SDa

1

Measurements

5 10 15 20 40 60

0.8 Probability of alarm

Scand J Clin Lab Invest Downloaded from informahealthcare.com by Fudan University on 05/07/15 For personal use only.

714

0.6

0.4

0.2

0 0

0.5

1

1.5

2

2.5

3

Systematic difference, SDa

Results

Figure 1. The probability of getting a nonequivalence alarm plotted against the difference between the means of two analytical methods measuring the same sample material. The difference is given in units of analytical standard deviation (SDa), which is supposed to be equal for the two methods. The probability of alarm is shown for the acceptance criterion of 1 SDa (A), 2 SDa (B) and 3 SDa (C). The specification limits are 0 ⫾ the acceptance criterion; however, only the probability curves of the positive differences are shown, as the curves are symmetric around zero. For each acceptance criterion, curves are constructed for a sample size (number of measurements with each analytical method) of 5, 10, 15, 20, 40 and 60.

Figure 1A–C shows the probability of getting a nonequivalence alarm as a function of the difference between the means, for the acceptance criterion of 1, 2 and 3 SDa, respectively. As the curves are symmetric around zero, only the positive differences are shown. For a given acceptance criterion the probability of alarm increases with increasing difference between the means and with decreasing sample size.

Furthermore, the probability of alarm is higher with smaller acceptance criteria. The power function curves converge at a probability of 0.95 when the difference between the means is equal to the upper specification limit. An exception is the power function curve of n ⫽ 5 for the acceptance criterion of

Sample size – assessing mean equivalence 1 SDa, where the probability of alarm is about 0.98 at this point (Figure 1A).

Scand J Clin Lab Invest Downloaded from informahealthcare.com by Fudan University on 05/07/15 For personal use only.

Discussion We used computer simulation to construct the power function curves, because an exact mathematical computation based on noncentral t-distributions is rather complex [5]. In contrast to others [5], we constructed the curves to show the probability of getting an alarm of nonequivalence when the difference between the means is within the specification limits. The reason for this was twofold. First, we think the decision on the sample size (the number of measurements with each analytical method) is a trade-off between the desirability and the cost of a low level of false alarm. Second, these power function curves have the same form as the power function curves of analytical quality control, which are well known to workers in clinical chemistry. The probability of true alarm, i.e. the probability of alarm of nonequivalence when the difference between the means really is at the upper specification limit, is always 0.95 if we use the 90% confidence interval for the difference between the means, corresponding to a one-sided test situation [6]. The somewhat higher probability of the n ⫽ 5 curve in Figure 1A is probably due to the confidence interval, being relatively wide in this case, sometimes may cross also the lower specification limit. If we use the 80% confidence interval, the power function curves would be at a somewhat lower level at any difference between the means and the probability of true alarm would be 0.90. We used the 90% confidence interval, in accordance with others [7]. How can we use these power function curves? Suppose, for instance, that the specification limits are 0 ⫾ 1 SDa, meaning that we accept the two analytical methods as equivalent if the 90% confidence interval for the difference between their means is lying in the range of minus 1 to plus 1 SDa. From Figure 1A we see that measuring the sample material 10 times with each method implies a probability of false alarm in excess of 0.6 even if the difference between the means is zero. If the difference is 0.5 SDa, i.e. half-way between zero and the upper specification limit, the probability of false alarm is nearly 0.8. In this case we obviously should decide on a larger number of measurements, for instance 40–50, to get a probability of false alarm down to a level of, say, 0.2 when the difference between the means is 0.5 SDa. However, that decision is just one of legio decisions we could take in this situation. There is no single, right

715

answer to the question of sample size. The decision depends on the cost of measuring and our willingness to risk an alarm of nonequivalence. Of course, the difference between the means may really be within the specification limits in spite of getting a nonequivalence alarm. Increasing the number of measurements to get a smaller confidence interval and recalculating is an option. However, the power function curves do provide the important information of the probability of false alarm, so that we can make our plans under less uncertainty. If the specification limits are 0 ⫾ 2 SDa (Figure 1B) we might be planning to do 10–15 measurements with each analytical method, and perhaps 5–10 measurements if the specification limits are 0 ⫾ 3 SDa (Figure 1C). A drawback with these curves may be the presumption of equal SDa for the two analytical methods. However, in many cases this presumption is not far-fetched and the decision of sample size is a fairly rough estimate anyway. Also, we constructed a very limited number of power function curves. We think, however, that this limited number of curves is sufficient to demonstrate how the probability of getting a nonequivalence alarm varies with the difference between the means, with the acceptance criterion, and with the number of measurements with each method over a range that covers most scenarios in clinical chemistry.

Declaration of interest: The authors report no conflict of interest. The authors alone are responsible for the content and writing of the paper.

References [1] Limentani GB, Ringo MC, Ye F, Berquist ML, McSorley EO. Beyond the t-test: statistical equivalence testing. Anal Chem 2005;77:221A–26A. [2] Chatfield MJ, Borman PJ. Acceptance criteria for method equivalency assessments. Anal Chem 2009;81:9841–8. [3] Åsberg A, Solem KB, Mikkelsen G. Allowable systematic difference between two instruments measuring the same analyte. Scand J Clin Lab Invest Published Online First: 9 June 2014. doi: 10.3109/00365513.2014.921836. [4] Zar JH. Biostatistical analysis, 5th ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2010; p 133. [5] Phillips KF. Power of the two one-sided tests procedure in bioequivalence. J Pharmacokinet Biopharm 1990;18:137–44. [6] Åsberg A, Bolann B, Mikkelsen G. Using the confidence interval of the mean to detect systematic errors in one analytical run. Scand J Clin Lab Invest 2010;70:410–4. [7] Borman PJ, Chatfield MJ, Damjanov I, Jackson P. Design and analysis of method equivalence studies. Anal Chem 2009; 81:9849–57.

Determining sample size when assessing mean equivalence.

When we want to assess whether two analytical methods are equivalent, we could test if the difference between the mean results is within the specifica...
96KB Sizes 4 Downloads 11 Views