Role of data warehousing in healthcare epidemiology.

Journal of Hospital Infection 89 (2015) 267e270 Available online at www.sciencedirect.com

Journal of Hospital Infection journal homepage: www.elsevierhealth.com/journals/jhin

Role of data warehousing in healthcare epidemiology D. Wyllie a, *, J. Davies b a b

Public Health England Academic Collaborating Centre, John Radcliffe Hospital, Oxford, UK Oxford NIHR BRC Informatics Programme, Department of Computer Science, University of Oxford, Oxford, UK

A R T I C L E

I N F O

Article history: Received 25 November 2014 Accepted 6 January 2015 Available online 29 January 2015 Keywords: Data management Epidemiology Healthcare data

S U M M A R Y

Electronic storage of healthcare data, including individual-level risk factors for both infectious and other diseases, is increasing. These data can be integrated at hospital, regional and national levels. Data sources that contain risk factor and outcome information for a wide range of conditions offer the potential for efficient epidemiological analysis of multiple diseases. Opportunities may also arise for monitoring healthcare processes. Integrating diverse data sources presents epidemiological, practical, and ethical challenges. For example, diagnostic criteria, outcome definitions, and ascertainment methods may differ across the data sources. Data volumes may be very large, requiring sophisticated computing technology. Given the large populations involved, perhaps the most challenging aspect is how informed consent can be obtained for the development of integrated databases, particularly when it is not easy to demonstrate their potential. In this article, we discuss some of the ups and downs of recent projects as well as the potential of data warehousing for antimicrobial resistance monitoring. ª 2015 Published by Elsevier Ltd on behalf of the Healthcare Infection Society.

Introduction Healthcare epidemiology, a branch of epidemiology concerned with the detection, control, and prevention of adverse events in the health economy, has gained prominence in recent years.1 This is attributable partly to a desire to learn more about the determinants of morbidity, mortality and cost in modern healthcare, and partly to an expectation that continuous quality monitoring and benchmarking can be built into efficient management systems.2 Data warehousing is a process by which information can be shared efficiently; the level on which it is shared may be an organization, a region, a network, or a nation.3 Information in the data warehouse may be in the form of a single definitive record, as occurs with integrated electronic patient care * Corresponding author. Address: Public Health England Academic Collaborating Centre, John Radcliffe Hospital, Headley Way, Oxford OX3 9DU, UK. Tel.: þ44 (0)1865 220860. E-mail address: [email protected] (D. Wyllie).

systems in some hospitals. This ‘top down’ strategy has many attractions, but can be hard to implement and may be impractical when different systems play key roles not available in a core system.3 Another common scenario uses multiple independent systems, resolving the problem posed by systems that cannot readily exchange information with each other by making the different systems contribute instead to a common ‘information pool’, thus allowing access to data across the organization as a whole. Inconsistencies may exist between data received from the different systems, but these can be resolved in order to generate a consistent ‘single source of data’ to inform policy making.3 This latter scenario is widespread in hospitals, where dozens or hundreds of independent systems may be in use.4,5 Some of these systems may contain very large amounts of information, such as databases containing laboratory tests or those tracking patient movements, others being restricted to much smaller patient populations. The quality of these small data sets, sometimes called ‘long tail data’, may be very high, but they may be ignored in data warehouses owing to the cost of integrating them.6

http://dx.doi.org/10.1016/j.jhin.2015.01.005 0195-6701/ª 2015 Published by Elsevier Ltd on behalf of the Healthcare Infection Society.

268

D. Wyllie, J. Davies / Journal of Hospital Infection 89 (2015) 267e270

As in other fields, the information generated in healthcare is increasing rapidly. Widespread use of electronic patient records, routine recording of free-text communications (such as discharge summaries), real-time patient tracking, continuous monitoring of patient physiology, cross-sectional imaging, telehealth, and human and microbial genomics are all contributing. Integrating these data to create what is called ‘big data’ presents technical and strategic challenges; if successful, the integrated data will relate both to outcomes of interest (such as mortality, complications, or microbial spread) and to a wide range of risk factors. ‘Big data’ is a concept defined as electronic health data sets so large and complex that they are difficult (or impossible) to manage with traditional software and/or hardware; nor can they be easily managed with traditional or usual data management tools and methods.7 The range of problems with big data has been expressed as ‘the four Vs’: velocity, variety, volume, and validity.7 These problems apply to a large extent to all aspects of data warehousing. In this article, we will illustrate problems and solutions associated with each of them, using examples from healthcare epidemiology. We will also discuss the ethical and social issues associated with large-scale information integration, as well as the feasibility of using big data for day-to-day monitoring of antimicrobial resistance.

Successes and challenges in data warehousing for healthcare epidemiology Velocity The first challenge concerns the rate at which data are accrued and the interval between accrual and analysis. Although many decisions (such as about antibiotic policy) are made on the basis of historical data sets that may have been gathered some time before analysis, other problems require much more rapid information synthesis and reporting. This may require particular kinds of storage arrangements. An example of ‘high speed’ data is use of emergency room monitoring to detect clinical deterioration, by integration of data from multiple sources to produce a single measure of need for increased care.8 Velocity is also important in syndromic surveillance systems designed to detect clusters indicative of point-source outbreaks or deliberate pathogen release.9

Variety A second challenge concerns the diversity of data sources used to assess the probability of an event of interest. One example is found in the use of research articles, satellite data on climate, historical and contemporary laboratory reports, and crowd-sourced reports of diagnoses to produce maps of disease.10 These data are highly disparate, but, by combining them, useful inferences may be drawn. Another example concerns post-marketing surveillance of pharmaceuticals, where multiple data sources are integrated to inform assessments of drug safety.11

Volume The third issue concerns data volume. A large UK hospital with coverage of about 0.7% of England has about 2 TB

(terabytes) of data stored in relational databases, excluding radiology and genetic data, whereas the US Kaiser Permanente health system stores about 5000 times more data, in excess of 10 PB (petabytes).5,7 Large databases are also being assembled in Europe, for example by the English National Health Service’s Health and Social Care Information Centre (http://www.hscic. gov.uk), and for microbial surveillance initiatives by Public Health England. Even the very large databases accrued in healthcare are small compared with data volumes generated in physical and astronomical sciences.12 The increasing volume of data stored obviously has an impact on the hardware and software required, leading to the emergence of new technical solutions.7 Nevertheless, there is a substantial cost to this, which can run into tens or hundreds of thousands of euro (V) per annum. The key advantages of large data volumes include wide area coverage, and the ability to detect small effects on rare outcomes. A study by Freemantle et al. exemplifies this. They investigated mortality at weekends in admissions to all English hospitals, and in a group of 254 US hospitals.13 Admission at the weekend was associated with a significant increase in mortality that could not be explained by altered case-mix in the data sets available. These results have prompted a review of care delivery models in England. Another example is provided by the work of Shorr et al.14 Analysing 62 US hospitals over a four-year period, they identified 5975 patients with a clinical diagnosis of pneumonia that was supported by laboratory evidence. Of this cohort, 837 (14%) had pneumonia due to meticillin-resistant Staphylococcus aureus (MRSA). Even though this represents fewer than four cases per hospital per annum, the authors developed and validated a risk score for MRSA pneumonia, which they suggested might be used to restrict the use of anti-MRSA antimicrobial agents in high-risk areas. It seems unlikely that such a study could have been performed without use of a wide-area database. A third example using very large data sets is the detection of rare adverse events following drug or vaccine licensing. The rationale is that the impact of the drug or vaccine after licensing may differ from that before licensing, or that rare side-effects may remain undetected before licensing. This requires analysis of diverse data sources.11

Validity Are the data in the warehouse accurate enough for the intended use? Factors compromising validity would include the various kinds of biases that are known to compromise epidemiological studies, such as selection bias and misclassification bias, due to problems with the information used to identify the patients, the outcomes and the covariates of interest.15 A contemporary example illustrating validity problems is provided by Google Flu Trends, an algorithm which used search terms entered into Google to predict influenza incidence.16 Search terms associated with influenza incidence, determined by the US Centers for Disease Control and Prevention laboratory-based surveillance system, were identified and a model built that predicted influenza incidence based on these data. After an initial period of success, the model substantially overestimated influenza incidence. Two factors may have been responsible for this decline in performance: a period of public concern about influenza; and an alteration in the way that Google ‘suggests’ search terms to its users.16 This emphasizes the importance, for valid epidemiological

D. Wyllie, J. Davies / Journal of Hospital Infection 89 (2015) 267e270 inferences from large data sources, of continuous monitoring of the performance, relative to each other, of the multiple data sources.

Ethical and social issues Large databases may contain both exposures (such as smoking) and outcomes (such as death from lung cancer). Increasingly, they may also contain a large number of other exposures and outcomes, biomarker concentrations, information about individuals’ positions in social networks, genetic sequence, gene expression data, and so on. All of these may have been generated for the purposes of healthcare, and much scientific value may derive from analysis of combinations of these variables. In this respect, large healthcare databases differ fundamentally from pre-specified, narrowly focused studies such as the famous cohort study of UK doctors demonstrating the impact of smoking on lung cancer, although they could be used to reproduce the Doll study.17 The key differences are that most of the content of large databases is multi-purpose, and that most or all of it may lack specific individual-level consent for analysis. Whereas a few custom resources, such as the UK Biobank, explicitly obtained consent for unplanned analyses, in most situations (e.g. performing a Google search; providing data to a general practitioner; using Facebook) consent processes may be less clear.18 Recognizing the immense benefits that may accrue from such integrations, the NHS is currently working to develop effective mechanisms for public consultation on large-scale integration of primary and secondary healthcare data across England (http://www.hscic.gov.uk/gpes/ caredata). It has been argued that similar large-scale integration would also be valuable in the USA.19,20 Early experiments in the release of ‘big data’ to the public, motivated by the idea that wide access would speed rapid innovation, led to the identification of individuals, despite the belief that the data had been ‘anonymized’.21 Allowing such unfettered access is common practice when delineating novel genomic sequences and microarray data sets, and has had some notable successes.22 But it is now well recognized that as the number of data sources, and their richness, increases, it becomes increasingly difficult to guarantee that identification will not occur. In the absence of regulation, the possibility remains that malign individuals might apply techniques, perhaps involving combinations of other data sources, e.g., that will enable unacceptable re-identification of data.21,23,24 In view of this, multiple organizations have put in place mechanisms to monitor access to large data resources. Typically, a data access committee vets applicants before releasing, to those judged to be trustworthy, information sufficient for a pre-specified project. Although this model has been criticized, perhaps because it restricts the ability to reproduce published research, it is becoming the standard mechanism for release of information in many settings.16 Mindful that maintenance of public trust in the process is paramount, some organizations, including the Genomics England sequencing initiative, and the OCTOPUS drug safety monitoring project, have put in place additional restrictions, such as requiring that analyses are performed within a monitored computer environment.11 Multiple projects appear to have arrived independently at a small set of solutions with similar goals and principles:

269

e Maximizing the scientific benefit from the generosity of the ‘data donors’, while minimizing the likelihood of their reidentification; e Ensuring public engagement and the opportunity to opt out, or, if that is impossible, arranging approval via government organizations, which can be held democratically accountable; e Reducing the amount of data released to that necessary for the project; e Accepting that anonymization is a relative, not an absolute, concept; e Putting in place mechanisms for vetting users before entrusting data to them. This process requires scientific, and sometimes technical, approval and monitoring of the work performed. In general, this precludes unconditional release of some types of data.

Data warehousing and the control of antimicrobial resistance Sophisticated information systems could become a critical component of a system for monitoring of antimicrobial resistance and its impact, thus delivering society-wide ‘information for action’. Even for a single indication, a range of different antimicrobial policies is implemented within the UK, and heterogeneity in antimicrobial prescribing exists across the UK and Europe.25e27 Given this heterogeneity, large-scale observational epidemiology should be able to detect any associations that might exist between antimicrobial exposure and a wide variety of outcomes, including: e treatment failure, which is commonplace and could be measured either by monitoring return to medical attention, or by patient self-reporting;28 e isolation of resistant organisms from individuals after treatment, determined from future microbiological samples; e spread of resistant organisms in the close contacts of treated individuals, e.g. Cooper et al.29 Such a system would be analogous to a post-marketing surveillance system.

Using knowledge gained from antimicrobial surveillance Would the effort entailed be worthwhile? What will we do if we find an association between use of a particular antibiotic and an adverse outcome, or with the spread of a highly multiresistant clone? What is the likelihood of false discovery rate? Will it be generalizable to other populations? Is incomplete or changing acquisition of outcome or antibiotic resistance data responsible? Will it really make any difference? We can have some confidence that we would be able to answer most of these questions. If the data set is large enough, we can examine generalizability across regions and subgroups. We may be able to use a conventional surveillance scheme to triangulate our results, as was done with Google Flu Trends. Concerns about data quality can be addressed by carefully designed prospective studies with enhanced data collection. There may be existing, high quality ‘long tail’ data sources that could assist us.

270

D. Wyllie, J. Davies / Journal of Hospital Infection 89 (2015) 267e270

Additionally, some questions may be addressed by emerging technologies, such as analysis of microbial genome sequences. Although we ultimately have to decide what price we put on better information, can we afford not to improve our monitoring of antibiotic resistance? It would be considered negligent to run an auto-analyser without ongoing quality monitoring and calibration. It is considered essential that individual practitioners take part in external quality assessment schemes and, at least in some countries, that hospitals embed quality monitoring and benchmarking systems in their management. For mass gatherings it is also considered essential to establish a surveillance system capable of detecting outbreaks.9 Regarding biosurveillance at the 2012 London Olympics, it was noted that ‘Although the absolute risk of health-protection problems, including infectious diseases, . is small, the need for reassurance of the absence of problems is higher than has previously been considered.’9 Given the known risks to individuals and communities of existing patterns of usage, this argument applies equally well to antimicrobial prescribing. Is it acceptable that we continue to use antimicrobials without effectively monitoring their impact? Conflict of interest statement None declared. Funding The research was supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre based at Oxford University Hospitals NHS Trust and University of Oxford. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

References 1. Hota B. Informatics for healthcare epidemiology. In: Sintchenko V, editor. Infectious disease informatics. New York: Springer; 2010. 305e321. 2. Tsang C, Palmer W, Bottle A, Majeed A, Aylin P. A review of patient safety measures based on routinely collected hospital data. Am J Med Qual 2012;27:154e169. 3. Kimball R, Ross M. The data warehouse toolkit: the definitive guide to dimensional modeling. 3rd ed. Indianapolis, IN: Wiley; 2013. xxxiv, 564 p. ´lvarez L, Aylin P, Tian J, et al. Data linkage between 4. Garcı´a A existing healthcare databases to support hospital epidemiology. J Hosp Infect 2011;79:231e235. 5. Finney JM, Walker AS, Peto TE, Wyllie DH. An efficient record linkage scheme using graphical analysis for identifier error detection. BMC Med Inform Decis Mak 2011;11:7. 6. Ferguson AR, Nielson JL, Cragin MH, Bandrowski AE, Martone ME. Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat Neurosci 2014;17:1442e1447. 7. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inform Sci Syst 2014;2:3. 8. Wilson SJ, Wong D, Pullinger RM, Way R, Clifton DA, Tarassenko L. Analysis of a data-fusion system for continuous vital sign monitoring in an emergency department. Eur J Emerg Med 2014 Jul 9 [Epub ahead of print].

9. McCloskey B, Endericks T, Catchpole M, et al. London 2012 Olympic and Paralympic Games: public health surveillance and epidemiology. Lancet 2014;383(9934):2083e2089. 10. Hay SI, George DB, Moyes CL, Brownstein JS. Big data opportunities for global infectious disease surveillance. PLoS Med 2013;10:e1001413. 11. Trifiro G, Coloma PM, Rijnbeek PR, et al. Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how? J Intern Med 2014;275:551e561. 12. Mattmann CA. Computing: a vision for data science. Nature 2013;493(7433):473e475. 13. Freemantle N, Richardson M, Wood J, et al. Weekend hospitalization and additional risk of death: an analysis of inpatient data. J R Soc Med 2012;105:74e84. 14. Shorr AF, Myers DE, Huang DB, Nathanson BH, Emons MF, Kollef MH. A risk score for identifying methicillin-resistant Staphylococcus aureus in patients presenting to the hospital with pneumonia. BMC Infect Dis 2013;13:268. 15. Delgado-Rodrı´guez M, Llorca J. Bias. J Epidemiol Comm Health 2004;58:635e641. 16. Lazer D, Kennedy R, King G, Vespignani A. The parable of Google Flu: traps in big data analysis. Science 2014;343(6176):1203e1205. 17. Doll R, Hill AB. A study of the aetiology of carcinoma of the lung. BMJ 1952;2(4797):1271e1286. 18. Elliott P, Peakman TC, Biobank UK. The UK Biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. Int J Epidemiol 2008;37:234e244. 19. Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA 2014;311:2479e2480. 20. Larson EB. Building trust in the power of “big data” research to serve the public good. JAMA 2013;309:2443e2444. 21. Gehrke J. Quo vadis, data privacy? Ann NY Acad Sci 2012;1260: 45e54. 22. Rohde H, Qin J, Cui Y, et al. Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. N Engl J Med 2011;365: 718e724. 23. Schadt EE, Woo S, Hao K. Bayesian method to predict individual SNP genotypes from gene expression data. Nat Genet 2012;44:603e608. 24. Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science 2013;339(6117): 321e324. 25. Hawker JI, Smith S, Smith GE, et al. Trends in antibiotic prescribing in primary care for clinical syndromes subject to national recommendations to reduce antibiotic resistance, UK 1995‒2011: analysis of a large database of primary care consultations. J Antimicrob Chemother 2014;69:3423e3430. 26. Cooke J, Stephens P, Ashiru-Oredope D, et al. Longitudinal trends and cross-sectional analysis of English national hospital antibacterial use over 5 years (2008e13): working towards hospital prescribing quality measures. J Antimicrob Chemother 2015;70: 279e285. 27. Weist K. Surveillance of antimicrobial consumption in Europe. Stockholm: ECDC; 2014. 28. Currie CJ, Berni E, Jenkins-Jones S, et al. Antibiotic treatment failure in four common infections in UK primary care 1991‒2012: longitudinal analysis. BMJ 2014;349:g5493. 29. Cooper BS, Kypraios T, Batra R, Wyncoll D, Tosas O, Edgeworth JD. Quantifying type-specific reproduction numbers for nosocomial pathogens: evidence for heightened transmission of an Asian sequence type 239 MRSA clone. PLoS Comput Biol 2012;8: e1002454.

Empowering Mayo Clinic Individualized Medicine with Genomic Data Warehousing.

Demonstration of Hadoop-GIS: A Spatial Data Warehousing System Over MapReduce.

Big Data Analytics in Healthcare.

Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.

Epidemiology and 'big data'.

The value of healthcare data in ophthalmology.

Use of administrative data in healthcare research.

Secondary data analysis of epidemiology in Asia.

Research Methods in Healthcare Epidemiology: Survey and Qualitative Research.

Genetic epidemiology and preventive healthcare in multiethnic societies: the hemoglobinopathies.

Role of epidemiology in vaccine policy.

Epidemiology, pathogenesis of primary hyperparathyroidism: Current data.

Role of exposure databases in epidemiology.

Guidance for infection prevention and healthcare epidemiology programs: healthcare epidemiologist skills and competencies.

Who Owns the Data? Open Data for Healthcare.

Federalist principles for healthcare data networks.

Identifying inference attacks against healthcare data repositories.

Data-driven healthcare: from patterns to actions.

Toward a Literature-Driven Definition of Big Data in Healthcare.

Developing a standardized healthcare cost data warehouse.

The promise and perils of big data in healthcare.

Harnessing the power of big data in healthcare.

Epidemiology. New data to reveal leukaemia links?

Initial data on the molecular epidemiology of cryptosporidiosis in Lebanon.