pharmacoepidemiology and drug safety 2015; 24: 504–509 Published online 10 March 2015 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/pds.3750

ORIGINAL REPORT

Missing laboratory test data in electronic general practice records: analysis of rheumatoid factor recording in the clinical practice research datalink Cormac J. Sammon1*, Anne Miller2, Kamal R. Mahtani3, Tim A. Holt3, Neil J. McHugh1,4, Raashid A. Luqmani5 and Alison L. Nightingale1 1

Department of Pharmacy and Pharmacology, University of Bath, Claverton Down, Bath BA2 7AY, United Kingdom Department of Rheumatology, Nuffield Orthopaedic Centre, Oxford University Hospitals NHS Trust, Oxford OX3 7HE, United Kingdom 3 Nuffield Department of Primary Care Health Sciences, Radcliffe Observatory Quarter, Woodstock Road, Oxford OX2 6GG, United Kingdom 4 Department of Rheumatology, Royal National Hospital for Rheumatic disease, Bath BA27AY, United Kingdom 5 NIHR Musculoskeletal Biomedical Research Unit, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Science, University of Oxford, United Kingdom 2

ABSTRACT Purpose To investigate whether information from the literature could be used to identify periods of practice data in an electronic healthcare database during which rheumatoid factor (RF) test results are likely to be missing-not-at-random (MNAR). Methods RF tests recorded in the Clinical Practice Research Datalink (CPRD) were identified and defined as having a positive, negative or missing result. The proportion of positive test results was then calculated based on (i) complete-case analysis (ii) after restriction to tests from practice years with no missing test results and (iii) following multiple imputation of missing test results. The same three analyses were then carried out after excluding practice years with a proportion of positive tests incompatible with the missing completely at random (MCAR) assumption. Results We identified 127 969 RF test records, 30.4% of which did not have an associated test result. Among tests with results available, 19% were positive. Both multiple imputation of the 38 867 missing test results and restriction of the study population to the 491 practice years with complete data had little impact on the percentage of positive tests. Following exclusion of the 544 practice years in which data were likely to be MNAR the percentage of positive tests in all analyses decreased to ~7%. Conclusions Recording of RF tests and RF test results in the CPRD is incomplete, with data likely to be MNAR in many practices. Exclusion of practice years with a high proportion of positive tests brought the distribution of positive tests in the study in line with the literature. Copyright © 2015 John Wiley & Sons, Ltd. key words—primary care; database; diagnostic accuracy; missing data; rheumatoid factor; pharmacoepidemiology

Received 29 April 2014; Revised 08 September 2014; Accepted 10 December 2014

INTRODUCTION Primary care databases contain a large amount of data on diagnostic laboratory tests requested in UK general practice. These data could be of value in determining the diagnostic accuracy and economic value of these tests and, from a pharmacoepidemiological standpoint, *Corresponding to: C Sammon, Department of Pharmacy and Pharmacology, University of Bath, Claverton Down, Bath BA2 7AY, United Kingdom. E-mail: [email protected]

Copyright © 2015 John Wiley & Sons, Ltd.

may be used as markers of the presence, severity and progression of disease. Data on primary care databases are generated from records that have been used primarily for clinical rather than research purposes. Therefore in the absence of automatic data entry of laboratory test results, decisions to record these results are likely to be dependent on perceived clinical importance. For example, normal test results may be deemed less clinically relevant and recorded less completely than abnormal results. Consequently, missing test data cannot be assumed to be

MISSING RHEUMATOID FACTOR TEST DATA IN THE CPRD

missing completely at random (MCAR) and are likely to be missing not at random (MNAR) with either the clinical codes recording the occurrence of a test (test records) or the numerical test results themselves (test results) being preferentially recorded if the results are abnormal. The potential MNAR nature of test data is a major obstacle to its use because it will introduce a selection bias which cannot be easily remedied using standard methods such as multiple imputation.1,2 Recent increases in the number of practices with automated transfer of test results directly from laboratories into patient records should reduce the proportion of test results that are entered based on subjective decisions regarding clinical relevance. As a result, the proportion of test data that are not MCAR is also likely to decrease in practices over time. Rheumatoid factor (RF) may be tested in UK primary care in patients with suspected rheumatoid arthritis (RA). Between 2003 and 2009 Miller et al. 3 collected complete RF test data from a laboratory serving 108 UK GP practices, observing a mean percentage of positive test results per practice of 6.3%. This is in line with other published studies of complete test data where on average 5–10% of RF tests conducted in the general population are positive.4–6 Based on the distribution of the percentage of positive test results observed per practice in the Miller et al. study (Figure 1), practices with a percentage of positive tests

505

greater than ~20% appear unlikely if data are complete or MCAR. The aim of this study was to investigate whether the above information on the expected proportion of positive tests in a practice could be used to identify and exclude potentially MNAR RF test data in the CPRD, producing a dataset in which RF test records and RF test results can be analysed assuming that data are either MCAR or MAR. METHOD Study population The CPRD contains the anonymised primary care records of approximately 8.4% of the UK population. Patient data routinely available in the database include demographic details, diagnoses and symptoms, hospital referrals and admissions, immunisations, pregnancies, laboratory tests, prescriptions issued by the GP and deaths. The owners of the CPRD, the Medicines and Healthcare products Regulatory Agency, operate continuous quality control procedures. Those practices meeting a specific set of quality criteria are considered of a sufficient standard for research purposes.7 The medical records of all ‘acceptable’ patients who were permanently registered on the CPRD between 1 January 2000 and 31 December 2008 were searched for RF tests. All tests were identified irrespective of

Figure 1. Scatter plots and boxplots showing the relationship between the proportion of missing rheumatoid factor results among all tests (Mall) and the proportion of positive rheumatoid factor tests among all tests with a result (Prec) in practice years on the CPRD between 2000 and 2008. Results are shown for all practice years (1a, 1c) and for only those practice years with Prec < 20% (1b, 1d). The distribution of practice Prec observed in a smaller, published study in which no test data were missing (Mall = 0) is shown in a lighter colour for comparison

Copyright © 2015 John Wiley & Sons, Ltd.

Pharmacoepidemiology and Drug Safety, 2015; 24: 504–509 DOI: 10.1002/pds

506

c. j. sammon et al.

whether there was a test value or normal range recorded against it. The first test per individual was included in the study population. First RF tests recorded in individuals aged 20% was used in this study as the cut-off to identify two potential missingness mechanisms:

• •

Among the practice years with Mall = 0% (i.e. there were no missing test values), a Prec > 20% was considered suggestive that codes recording the occurrence of an RF test were not MCAR. Among practice years with Mall > 0% (i.e. there were missing test values), a Prec > 20% was considered suggestive of RF test result records not being MCAR.

Analysis of restricted dataset Practice years considered unlikely to be MCAR under the above definitions were excluded and a complete case analysis (complete case analysis 2), complete practice-year analysis (complete practice-year analysis 2) and multiple imputation analysis (multiple imputation analysis 2) were carried out on this restricted study population as described above. The proportion of positive and negative tests in each of these analyses were compared with each other and with the results obtained before exclusion of practice years with a Prec > 20%. Practices with 0% had a Prec > 20%, indicating a high likelihood of the results of negative tests not being recorded. Analysis of restricted dataset The results obtained following exclusion of the 544 practice years with a Prec > 20% are shown in Table 1 (b). The percentage of positive tests in the complete case analysis (7.5%), complete practice-year analysis (7.1%) and multiple imputation analysis (7.6%) were similar but were considerably lower than those observed prior to exclusion of practice years with a Prec < 20%. The distribution of Prec values after exclusion was similar to that reported by Miller et al. 3 (Figure 1b, 1d). Copyright © 2015 John Wiley & Sons, Ltd.

DISCUSSION According to our hypothesis, practice years with greater than 20% positive tests were unlikely to occur if data are MCAR. We found the recording of RF test records and test results on the CPRD to be consistent with data in many practice years being MAR or MNAR. Restriction of the study population to practice years with complete recording had little impact on the proportion of positive tests in the dataset, as did multiple imputation of missing test results, suggesting that the missing data are MNAR (given complete recording of test results and given our imputation model). Exclusion of all data from practice years with greater than 20% positive tests, including those practice years with complete test results, brought the proportion of positive tests in line with studies with complete data on RF results.3 The validity of our analysis hinges largely on the assumption that a Prec greater than 20% indicates that data are unlikely to be MCAR. In this regard, the similarity between the Prec distribution observed by Miller et al3 and that observed in our data (after exclusion of practice years with a Prec < 20%) is reassuring (Figure 1b, 1d). Both our CPRD dataset and the Miller et al.3 study only included tests which were requested by a general practice; therefore, both studies are likely to include a similar population of individuals suspected of having rheumatoid arthritis. The Prec cut-off was set a little lower than the maximum Prec observed in the study by Miller et al.3 in an effort to maximise the exclusion of MNAR data, at the potential cost of excluding some practice years with data that are actually MCAR. The application of multiple imputation had little impact before or after exclusion of practice years with a Pharmacoepidemiology and Drug Safety, 2015; 24: 504–509 DOI: 10.1002/pds

508

c. j. sammon et al.

Prec greater than 20%, indicating that the data were not MAR given our imputation model. This does not exclude the possibility that data would be MAR given an imputation model containing different variables; however, we believe that we have included the most relevant variables available in the CPRD in the imputation model. While we were able to identify and exclude a large number of practices where RF test data is likely to be MNAR, we cannot be sure that test records and test results are MCAR in all the remaining practice years. The use of an even lower Prec cut-off might increase our sensitivity to identify MNAR data; however, it would also decrease both the specificity for MCAR data and the available study population. Furthermore, in practice years with missing test results (Mall > 0%) the distribution of positive to negative results was assumed to only be affected by missing test results. As we have shown for practice years where a result was recorded for 100% of tests (Mall = 0%), practices may also be affected by records of tests themselves being MNAR. It was not possible to investigate the interaction between test records being MNAR and test results being MNAR in practices with Mall > 0%. Our exclusion of all test results from certain practice years means that our approach may not be suitable for studies which require serial laboratory measurements. An option in situations where serial measurements are needed might be to remove the results of all tests in practice years in which results are considered likely to be MNAR and then use an imputation method suitable for longitudinal data (such as the two-fold fully conditional specification algorithm8) to impute results back into the test records in these practice years. Despite these limitations, our method represents a step forward from the inappropriate analysis of datasets containing data that is MNAR using complete case analysis, complete practice analysis, missing categories and multiple imputation. Under an extreme MNAR assumption, imputation of all missing test results as negative results would be an alternative option to the restriction method applied above. However, while this would have produced a more plausible Prec distribution and overall proportion of positive tests (13%) than the complete case or multiple imputation analysis, it would not have been able to account for the high Prec observed in practices with complete results and would have been limited by the extremity of the assumption that every missing test was negative. Therefore while our method may not result in the complete removal of MNAR data from a study, it should substantially Copyright © 2015 John Wiley & Sons, Ltd.

reduce the amount of MNAR data relative to more naive methods while maintaining large, epidemiologically useful study populations. The dataset obtained in this study has been used to estimate the diagnostic accuracy of RF testing in general practice for rheumatoid arthritis.9 Clinically meaningful differences in diagnostic accuracy estimates were obtained before and after exclusion of practice years with a Prec > 20%. The approach we present here might be applied to a range of other clinical tests in this and other data sources. The relative simplicity of the approach should appeal as it only requires information on the expected ratio of positive to negative results in a practice. If such information is not available in the literature it might be collected through a pilot study, or even estimated using the data (e.g. using the interquartile range of Prec values in those with Mall = 0). The performance of different Prec cut-offs could also be investigated in sensitivity analyses. In conclusion, recording of RF tests and RF test results in the CPRD is incomplete, with data likely to be MNAR in many practices. This finding should stimulate investigators embarking on a study using practice-based electronic healthcare databases to carefully consider whether any data they are using may be MNAR. If data are thought to be potentially MNAR the exclusion method described herein may prove useful in developing an analysis ready dataset. CONFLICT OF INTEREST The authors declare no conflict of interest. KEY POINTS The ratio of positive to negative rheumatoid factor (RF) tests per practice year in the Clinical Practice Research Datalink is much higher than that reported in sources with no missing data, even after multiple imputation. • This suggests that RF test data are missing not at random (MNAR); missing values and tests are more likely to be negative because a negative RF is of less clinical relevance. • An expected distribution of positive tests per practice was identified from the literature, using this practice years during which test data are likely to be MNAR were identified • Exclusion of data from such practice years brought the ratio of positive to negative test results in line with the literature.



Pharmacoepidemiology and Drug Safety, 2015; 24: 504–509 DOI: 10.1002/pds

MISSING RHEUMATOID FACTOR TEST DATA IN THE CPRD

ETHICS STATEMENT The study protocol was approved by the Independent Scientific Advisory Committee (ISAC) for the CPRD, protocol number 10_105R. ACKNOWLEDGEMENTS We would like to thank Corinne de Vries, Professor of Pharmacoepidemiology for her invaluable discussion. This study is based in part on data from the Full Feature General Practice Research Database (now termed Clinical Practice Research Database) obtained under licence from the UK Medicines and Healthcare Products Regulatory Agency. However, the interpretation and conclusions contained in this study are those of the authors alone. FUNDING Access to the CPRD database was funded through the Medical Research Council’s licence agreement with MHRA. Further funding was provided by the Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford and the MSK Biomedical Research Unit, Oxford. PRIOR POSTINGS The work contained in this manuscript has been presented as a poster at the 2013 International

Copyright © 2015 John Wiley & Sons, Ltd.

509

Conference of Pharmacoepidemiology and Therapeutic Risk Management in Montreal. The main set of empirical diagnostic accuracy results presented herein have been presented as a poster at the American College of Rheumatology Annual Meeting 2013 and have been included as part of a manuscript submitted for publication in a peer-reviewed clinical journal.

REFERENCES 1. Marston L, Carpenter JR, Walters KR, et al.. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiol Drug Saf 2010; 19(6): 618–26. doi:10.1002/pds.1934. 2. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 2009; 338: b2393. doi:10.1136/bmj.b2393. 3. Miller A, Mahtani KR, Waterfield MA, et al. Is rheumatoid factor useful in primary care? A retrospective cross-sectional study. Clin Rheumatol 2013; 49(suppl 1): i68. doi:10.1007/s10067-013-2236-0. 4. Nielsen SF, Bojesen SE, Schnohr P, et al. Elevated rheumatoid factor and long term risk of rheumatoid arthritis: a prospective cohort study. BMJ 2012; 345: e5244. doi:10.1136/bmj.e5244. 5. van Schaardenburg D, Lagaay AM, Otten HG, et al. The relation between classspecific serum rheumatoid factors and age in the general population. Br J Rheumatol 1993; 32(7): 546–9. 6. Simard JF, Holmqvist M. Rheumatoid factor positivity in the general population. BMJ 2012; 345: e5841. doi:10.1136/bmj.e5841. 7. Williams T, van Staa T, Puri S, et al. Recent advances in the utility and use of the General Practice Research Database as an example of a UK Primary Care Data resource. Therapeutic Advances in Drug Safety 2012; 3(2): 89–99. 8. Welch CA, Petersen I, Bartlett JW, et al. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Stat Med 2014 Apr 30. doi:10.1002/sim.6184. 9. Miller A, Nightingale AL, Sammon CS, et al. The diagnostic accuracy of rheumatoid factor testing in primary care. Arthritis & Rheumatism 2013; 65(S10): 964–5.

Pharmacoepidemiology and Drug Safety, 2015; 24: 504–509 DOI: 10.1002/pds

Missing laboratory test data in electronic general practice records: analysis of rheumatoid factor recording in the clinical practice research datalink.

To investigate whether information from the literature could be used to identify periods of practice data in an electronic healthcare database during ...
307KB Sizes 0 Downloads 11 Views