Assessing Precision and Accuracy in Blood Gas Proficiency Testing 1 - 3

JAMES E. HANSEN, RICHARD CASABURI, ROBERT O. CRAPO, and ROBERT L. JENSEN

Introduction

Proficiency testing, using ampules of tonometered buffers, has been used on a regional and national scale for at least 10 yr (1-7). Until the present time, blood gas proficiency testing rating systems have been used to assess instrumenttechnician accuracy for three analytes: pH, Pe0 2 , and P0 2 • The rating systems used havebeen independent for each analyte as if the assessment of each analyte were unrelated (8-10). The problem of rating instrument-technician accuracy is complicated by at least four additional factors: (1) proficiency testing materials distributed in ampules are not identical to freshly tonometered blood (1-7), (2) mean valuesobtained by different instrument models for the same lot of proficiency testing materials consistently differ (2, 6, 7), (3) variances about the instrument-specific mean values of the many instrument models differ substantially (2, 6, 7), and (4) there is uncertainty as to what target and limiting values should be selected for rating systems (8-10).

Recently, one lot of proficiency testing materials was distributed by the American Thoracic Society-CaliforniaThoracic Society Blood Gas Proficiency Testing Survey (PTS) to more than 800 instruments three times within a l-yr period. Thus, for this single lot, it was feasible: (1) to assess precision (reproducibility) of measurement over a l-yr period for each analyte independent of accuracy (proximity to the correct or target value); (2) to determine whether there was a relationship between measures of accuracy and precision among the three analytes; and (3) to compare two possible measures of accuracy with measures of precision. The ultimate goal of this study is to see if a measure of precision might be devised that would improve the current assessment and rating system of instrument performance.

1190

SUMMARY Blood gas proficiency testing has focused on assessing the accuracy of measurement of each analyte (pH, peo2 , and P0 2 ) independently of each other. Recently, the American Thoracic Society-California Thoracic Society Blood Gas Proficiency Testing Survey distributed the same lot of ampules of proficiency testing material (a buffered fluorocarbon-containing emulsion) on three occasions within a 1-yr period, allowing us to assess the precision (reproducibility) of measurement of each analyte. Comparing 580 instruments of 13 models, we found that the precision of measurement of each analyte was positively correlated with the precision of measurement of each other analyte, and the correlation of precision between models was much stronger than precision between the individual instruments. We also found correlation of precision of each analyte with two targets for accuracy: (1) the all-instrument mean and (2) the model-specific means. Correlations were higher with the model-specific means. These findings suggest: (1) that features unique to design of each model are important in the precision of measurement of these amputes, and (2) that it would be informative to include measurements of precision with linked and cumulative ratings of analyte accuracy in proficiency testing rating systems. AM REV RESPIR DIS 1990; 141:1190-1193

Methods Materials Every 3 months the PTS distributes as unknowns three different lots of fluorocarboncontaining emulsion (FeE) proficiency testing material to more than 400 laboratories with greater than 800 registered blood gas instruments. During a recent calendar year, they distributed at three different times, with concealed identity codes, a single lot of ampules (Lot no. 17), which had been prepared all at one time. For each of these quarters, participating registrants submitted their assay values by instrument for each analyte (pH, Pco,; and PO,) for this lot as well as for the other two lots. Participants wereunaware that ampules from a singlelot would be distributed more than once.

General Assessment We selected for analysis the results from all registered instruments that participated for each of the three quarters and for which there were 10or more instruments of the same model. Thirteen models and 583instruments were represented. First, we calculated the overall mean and SD value for each analyte from the combined data of all three quarters. Second, we screened the data and excluded from further analysesthe three instruments with values for one quarter for each analyte that exceeded 3 SD from the overall means (implying transcription or gross analytic errors). Wethen

recalculated the mean and SD for each analyte by quarter.

Precision Assessment After this, we calculated, for each analyte, the precision of measurement of each individual instrument (the customary standard deviation of three measures by each instrument multiplied by 1.128 to give an unbiased estimate of the SD) (ll) and the average precision of all of the instruments of each model (sum of the unbiased SD values of all of the instruments of a given model divided by the number of instruments of that model). These measures allowed us to compare preci-

(Received in original form August 7, 1989 and in revised form October 30, 1989) I From the Department of Medicine, UCLA School of Medicine, Division of Respiratory and Critical Care Physiology and Medicine, HarborUCLA Medical Center, Torrance, California; and the Department of Medicine, LDS Hospital, and University of Utah College of Medicine, Salt Lake City, Utah. 2 Supported by a grant from the American Thoracic Society and the California Thoracic Society Blood Gas Proficiency Testing Programs. 3 Correspondence and requests for reprints should be addressed to James E. Hansen, HarborUCLA Medical Center, 1000 W. Carson St., Box 24, Torrance, CA 90509.

1191

PRECISION AND ACCURACY IN BLOOD GAS PROFICIENCY TESTING

sion between individual instruments and between models of instruments. For example, the instrument that had the smallest SD for that analyte was the most precise instrument for that analyte, whereas the one with the largest SD was the least precise. Similarly, the model with the smallest average SD was the most precise model, whereas that with the largest SD was the least precise. Next we calculated the three Pearson's correlation coefficients (r) for precision of measurement between analytes (i.e., pH versus Pco-, pH versus Po 2 , and Pco, versus Po 2 ) comparing (1) all 580 instruments using the individual instrument precision and (2) all 13models, using the average precision of each of the 13models of instruments (11). Thus, the r would be positive if a model or individual instrument that measured pH more precisely also, on average, measured Pco, more precisely. To ascertain whether or not the significant correlations for the models might be unduly influenced by a single model, we used a "jackknife" statistical technique, which excludes each model one at a time and then recalculates correlation coefficients for the other 12 models (12).

Accuracy Assessment To assess accuracy of measurement for each analyte, we calculated the absolute deviation of the mean of the three values of each instrument from two different target values: Method A, absolute deviations from allinstrument mean values, and Method B, absolute deviations from the model-specific mean values for that instrument. As will be seen later (table 2), the model-specific means for each analyte always differed from the allinstrument mean. Thus, for pH, instrument X might be considered less accurate than instrument Y using the all-instrument mean and more accurate than instrument Y using the model-specific mean. Using each instrument's deviations from the appropriate target value, we calculated Pearson's correlation coefficients for accuracy of pH versus Pco., accuracy of pH versus Po 2 , and accuracy of Pco, versus P0 2 for Method A and then for Method B. Precision versus Accuracy Assessment Finally, for each analyte, using the SD as the measure for precision and the absolute deviation of the mean from the target value (for both Methods A and B) as the measure for accuracy, we calculated Pearson's correlation coefficients (for both Methods A and B) using all instruments. For all correlations, we noted significance values.

ing the same time period (table 1). This indicates that the lot was stable over time and representative of other lots used in proficiency testing surveys. The ranges of the model-specific means were 7.333 to 7.358 units for pH, 45.5 to 49.9 mm Hg for Peo" and 67.0 to 75.9 mm Hg for P0 2 (table 2), indicating the diverse response of the models. The average SD for each model (table 2) was almost always less than the overall SD values found in table 1.

Precision Precision of measurement (SD) of each analyte was significantly and positively

TABLE 1 ALL INSTRUMENT (n

1

2 3

580) SURVEY VALUES BY QUARTER'

pH (units)

Peo, (mmHg)

po, (mmHg)

7.355 ± 0.012 7.353 ± 0.013 7.351 ± 0.014

48.2 ± 1.n 48.3 ± 2.05 48.4 ± 2.09

69.5 ± 4.54 70.0 ± 6.49 69.1 ± 3.96

Quarter

• Values are mean ± SO.

TABLE 2 MODEL-SPECIFIC MEAN VALUES AND STANDARD DEVIATIONS FOR THREE QUARTERS OF TESTING SDt

Mean Model'

Number

pH

Peo,

Po,

pH

Peo,

Po,

C0168 COHO COH5 C0178 IL813 1L1301 IL1303 IL1306 IL1312 RAB30 RABl2 RABL3 RBMS3

71 27 27 189 27 24 17 21 26 25 42 71 13

7.358 7.356 7.355 7.358 7.356 7.337 7.340 7.338 7.333 7.356 7.346 7.354 7.357

48.1 49.9 47.8 49.6 47.6 47.1 47.5 48.6 48.7 47.6 46.2 47.5 45.5

69.2 67.0 68.9 67.2 68.2 69.8 69.7 70.3 71.0 72.0 75.9 72.3 71.1

0.012 0.008 0.006 0.007 0.009 0.011 0.008 0.006 0.009 0.008 0.006 0.006 0.018

1.35 1.10 1.12 1.12 1.39 1.01 1.05

2.96 2.09 1.85 1.99 1.67 1.20 1.27 1.13 1.78 4.61 4.01 3.29 4.32

7.349 7.353

47.8 48.3

70.2 69.6

Mean, 13 models All instruments

o.n 0.77 1.05 1.74 1.05 2.34

• Model prefixes: CO = Coming; IL = Instrumentation Laboratories; R = Radiometer. SO is average SO of each instrument of that model for three quarters.

t

TABLE 3 CORRELATIONS (r) OF STANDARD DEVIATION (PRECISION) BETWEEN ANALYTES IN QUARTERLY SURVEYS WITH THE SAME LOT OF FLUOROCARBON-CONTAINING EMULSION'

Results

580 Individual Instruments

General The mean values of pH (7.353), Pe0 2 (48.3 mm Hg), and Po, (69.6 mm Hg) were similar from quarter to quarter, with instrument variability similar to those of other lots measured in the surveys dur-

correlated with the precision of measurement of the other analytes (table 3). The correlation coefficients were generally higher when models of instruments rather than individual instruments were compared. As each single model was removed from the statistical analysis one by one, the r values remained high; in the case of Pco, versus Po" the 12 r values averaged 0.616 and ranged from 0.475 to 0.816. This indicates that this overall strong correlation for precision was not unduly affected by any single datum point; figure 1 illustrates the betweenmodel relationship of precision of Po, and Pco, measurements.

Comparison pH versus peo, pH versus Po, Peo, versus Po,

13 Models p Value

p Value 0.283 0.363 0.226

< 0.0005 < 0.0005 < 0.0005

• Calculations described in Precision Assessment section of METHODS.

0.669 0.335 0.621

< 0.01 > 0.05 < 0.025

1192

HANSEN, CASABURI, CRAPO, AND JENSEN

0 .08

3.0

Assessing precision and accuracy in blood gas proficiency testing.

Blood gas proficiency testing has focused on assessing the accuracy of measurement of each analyte (pH, PCO2) independently of each other. Recently, t...
415KB Sizes 0 Downloads 0 Views