Quality of Big Data in health care.

International Journal of Health Care Quality Assurance Quality of Big Data in health care Sreenivas R. Sukumar Ramachandran Natarajan Regina K. Ferrell

Article information:

Downloaded by RMIT University At 09:05 02 February 2016 (PT)

To cite this document: Sreenivas R. Sukumar Ramachandran Natarajan Regina K. Ferrell , (2015),"Quality of Big Data in health care", International Journal of Health Care Quality Assurance, Vol. 28 Iss 6 pp. 621 - 634 Permanent link to this document: http://dx.doi.org/10.1108/IJHCQA-07-2014-0080 Downloaded on: 02 February 2016, At: 09:05 (PT) References: this document contains references to 51 other documents. To copy this document: [email protected] The fulltext of this document has been downloaded 738 times since 2015*

Users who downloaded this article also downloaded: Thomas H. Davenport, (2014),"How strategists use “big data” to support internal business decisions, discovery and production", Strategy & Leadership, Vol. 42 Iss 4 pp. 45-50 http:// dx.doi.org/10.1108/SL-05-2014-0034 Anthony Marshall, Stefan Mueck, Rebecca Shockley, (2015),"How leading organizations use big data and analytics to innovate", Strategy & Leadership, Vol. 43 Iss 5 pp. 32-39 http:// dx.doi.org/10.1108/SL-06-2015-0054 John McDonald, Valerie Léveillé, (2014),"Whither the retention schedule in the era of big data and open data?", Records Management Journal, Vol. 24 Iss 2 pp. 99-121 http://dx.doi.org/10.1108/ RMJ-01-2014-0010

Access to this document was granted through an Emerald subscription provided by emeraldsrm:393177 []

For Authors If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service information about how to choose which publication to write for and submission guidelines are available for all. Please visit www.emeraldinsight.com/authors for more information.

About Emerald www.emeraldinsight.com Emerald is a global publisher linking research and practice to the benefit of society. The company manages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as well as providing an extensive range of online products and additional customer resources and services. Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation.


*Related content and download information correct at time of download.

The current issue and full text archive of this journal is available on Emerald Insight at: www.emeraldinsight.com/0952-6862.htm

Quality of Big Data in health care Sreenivas R. Sukumar

Quality of Big Data in health care

Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA

Ramachandran Natarajan Department of Decision Sciences and Management, Tennessee Technological University, Cookeville, Tennessee, USA, and

Regina K. Ferrell


Electrical and Electronics Systems Research Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA

621 Received 21 July 2014 Revised 14 October 2014 1 March 2015 20 April 2015 Accepted 8 May 2015

Abstract Purpose – The current trend in Big Data analytics and in particular health information technology is toward building sophisticated models, methods and tools for business, operational and clinical intelligence. However, the critical issue of data quality required for these models is not getting the attention it deserves. The purpose of this paper is to highlight the issues of data quality in the context of Big Data health care analytics. Design/methodology/approach – The insights presented in this paper are the results of analytics work that was done in different organizations on a variety of health data sets. The data sets include Medicare and Medicaid claims, provider enrollment data sets from both public and private sources, electronic health records from regional health centers accessed through partnerships with health care claims processing entities under health privacy protected guidelines. Findings – Assessment of data quality in health care has to consider: first, the entire lifecycle of health data; second, problems arising from errors and inaccuracies in the data itself; third, the source(s) and the pedigree of the data; and fourth, how the underlying purpose of data collection impact the analytic processing and knowledge expected to be derived. Automation in the form of data handling, storage, entry and processing technologies is to be viewed as a double-edged sword. At one level, automation can be a good solution, while at another level it can create a different set of data quality issues. Implementation of health care analytics with Big Data is enabled by a road map that addresses the organizational and technological aspects of data quality assurance. Practical implications – The value derived from the use of analytics should be the primary determinant of data quality. Based on this premise, health care enterprises embracing Big Data should have a road map for a systematic approach to data quality. Health care data quality problems can be so very specific that organizations might have to build their own custom software or data quality rule engines. Originality/value – Today, data quality issues are diagnosed and addressed in a piece-meal fashion. The authors recommend a data lifecycle approach and provide a road map, that is more appropriate with the dimensions of Big Data and fits different stages in the analytical workflow. Keywords Data handling, Big Data analytics, Data quality, Health care analytics, Health information technology, Health care claims Paper type Technical paper

© This paper has been co-authored by employees of UT-Battelle, LLC, under contract DE-AC0500OR22725 with the US Department of Energy. Accordingly, the US Government retains and the publisher, by accepting the paper for publication, acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US Government purposes The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

International Journal of Health Care Quality Assurance Vol. 28 No. 6, 2015 pp. 621-634 Emerald Group Publishing Limited 0952-6862 DOI 10.1108/IJHCQA-07-2014-0080

IJHCQA 28,6


622

Introduction Today, Big Data’s key dimensions of volume, velocity and variety (Laney, 2001) are driving the development of scalable storage infrastructure, algorithms, software tools and newer models for analytics. From enterprise practice, value and veracity are emerging as two additional key dimensions of Big Data (Manyika et al., 2011; Liu, 2014). Value refers to the cost-benefit to the decision maker through the ability to take meaningful action based on insights derived from data. Veracity is defined (Merriam-Webster Dictionary, 2006) as “conformity with truth or fact.” In the context of Big Data, this refers to any source that influences accuracy and/or introduces uncertainty in the inference from data such as inconsistencies, missing data, ambiguities, deception, fraud, duplication, spam and latency. In that sense, data quality is subsumed under the definition of veracity. These five Vs (volume, velocity, variety, veracity and value) of Big Data are all inter-related. The link between data veracity and value is direct and clear, i.e., GIGO (Garbage In! Garbage Out!). With the other Vs, the relationships may be subtle and less obvious. For instance, volume can mask bad quality of data, velocity can rapidly propagate poor quality and variety can create data-context ambiguities. The challenges that arise from the above Vs are being addressed in different forums and working groups (BIG, 2015; BDSSG, 2015). Quality and productivity of health care systems are major concerns internationally (Thomson et al., 2013). For instance, the health care sector accounts for a significant share of the US gross domestic product (GDP). In 2011, spending on health care amounted to about 17.9 percent of GDP, making it the largest sector in the US economy (The Economist, 2013). There are tremendous opportunities for Big Data analytics to impact the productivity and quality of the health care sectors in the USA (Lighter and Bradley, 2013; Groves et al., 2013; Murdoch and Detsky, 2013) and other countries such as the UK, Japan, China, South-Korea, Thailand and Malaysia (Royal College of Physicians, 2006; Aljunid et al., 2012). The vision for the use of Big Data in health care is being mapped out (Safran et al., 2007; Europe 2020, 2014). Analytical insights gleaned from meaningful analysis of health care data can potentially cause changes to business and clinical models, can guide the expected roll-out of value-based purchasing, and can realize efficiencies through smarter delivery of care. For example, Big Data analytics in health care enables meaningful responses to queries such as: first, how are costs for various aspects of health care likely to rise in the future? Second, how are certain policy changes impacting cost and behavior? Third, how do health care costs vary geographically? Fourth, can fraudulent claims be detected? Fifth, what treatment options seem most effective for various diseases? Sixth, why do some providers seem to have better health outcomes? Seventh, why do patients choose one provider over another? Eighth, are there early signs of an epidemic? Researchers, analytics-vendors and software developers who create and deploy sophisticated infrastructure and organization-specific intelligence tools for health care decision and policy-making assume that the data quality is assured by the data-supplying organization. Data quality is often being taken for granted. Unlike sectors such as manufacturing, where market expectations drive the quality of a product, the market forces in the Big Data industry have not imposed a similar standard on data quality. This is particularly true for health care data. The major sources of poor data quality are addressed in this paper with real-world examples and illustrations. Figure 1 captures a typical lifecycle of health data and its use in the health care domain. On the left side of Figure 1 is a list of different types of health data. Data in the form of insurance claims, electronic health records (EHR) and pharmaceutical events,

Health Data Insurance Claims Electronic Medical Records

Data entry errors

Meaningful Use

Enterprise Data System

Econometric Models

Data staging errors

ComputerDiagnostic Models

Clinical Data (Sensors)

Research Repository

Gene Sequences

Healthcare Delivery Integrity


Claim Audits –Fraud, Waste and Abuse Strategic Policy Clinical Research

623

Operations Research

Pharmaceutical Social, Behavioral, Operations

Cost-Forecasting

Quality of Big Data in health care

Relevance and Context Errors

Eligibility and Coverage Models

Changes healthcare delivery

Epidemiological Decisions

Influences cost and coverage

Errors due to evolving standards

clinical/diagnostic laboratory data, genetic information, related geospatial statistics and potentially other weakly associated but relevant information. The parameters for the different Vs of the Big Data in health care are as follows. For a population of a million patients, the volume of health care claims data can be in the order of terabytes, and for genomic data, in the order of petabytes. Claims data in structured transactional form can stream in at the rate of about 20 claims per minute (velocity). Variety from unstructured data such as EHR and clinical image data have even higher volume and velocities than claims data. Moreover, the variability in volume, velocity and variety of Big Data in health care adds to the complexity and the difficulty of assuring veracity in data analytics. The lifecycle of health data begins with the integration of the different types of data from different sources to a centralized location. The centralized repository is then leveraged to run business-specific analytical models that recommend actionable insights. An example of a centralized system leading to such insights is the national clinical database in Japan (Murakami et al., 2014). The actionable insight often leads to changes and improved outcomes in quality and integrity of health care delivery and in most cases may mean additional data collection, regulatory improvements or a move toward convenience with better standards and codebooks. Quality issues can arise at every stage of data collection, integration, transformation and inference. Also, the chance of data quality issues propagating and quality deteriorating increases within the feedback-controlled health care system. This paper discusses in detail the primary sources of data quality problems illustrated in the figure. In the next section, the sources that contribute to low veracity are discussed along with its consequences in the health data lifecycle. The sources of errors discussed are not statistical errors due to sampling. Our focus is on non-sampling sources of errors. Non-sampling errors are difficult to quantify in estimates of accuracy and +/− type margin of errors. Sources, consequences and remedies An assessment of data quality in health care has to consider: first, problems arising from errors and inaccuracies in the data itself; second, the source(s) and the pedigree of the data; and third, how the underlying purpose of data collection impact the analytic

Figure 1. The lifecycle of health data and some sources of data quality errors in the lifecycle

IJHCQA 28,6


624

processing and knowledge expected to be derived. A consideration of errors and inaccuracies in the data would include data entry errors, missing data field or entire records, and errors arising from transformations in the extracting and transforming process for analytics (Svolba, 2014). An examination of the source(s) of the data can reveal limitations and concerns as to its appropriateness to the type of analysis being performed (e.g. use of financial data for evaluation of treatments), variations due to data merged from two different business models, variations due to entity and identity disambiguation (or a lack thereof) and variations due to constantly evolving business models. Additionally, data veracity issues can arise from attempts to preserve privacy where de-identifying/disguising data is intentional such as following the guidelines of the Health Insurance Portability and Accountability Act (HIPPA) law in the USA. Also, data veracity is a function of how many sources contributed to the data collection process and their similarities and differences. Often data sets integrated from multiple sources are characterized by different levels of data quality. This can result in degradation of the overall quality of the integrated data to the lowest level of data quality of the contributing sources. Sources that contribute to questionable veracity in health care can be broadly classified into the following five categories. For each category, some common modes of quality corruption that occur in real health care data are described with some examples. Relevance and context Ensuring that there is no misrepresentation with respect to the context within which insights are being extracted from the data are a major challenge in health care. As illustrated in Figure 1, the majority of the health care data collected is for financial accounting and legal/federal regulation purposes. While the quality of health data sets may exceed the expectations for the financial purposes, they may not meet the stringent quality requirements of clinical or epidemiological research. The completeness and accuracy of key data fields are dependent on the purpose of the data collection effort (Snee, et al., 2014). A study by the Royal College of Physicians (2006), concluded that hospital and patient episode data did not reliably reflect the working practices of physicians because it was designed mainly for administrative purposes. Data collected for financial reimbursement alone may not be the best data for inferring clinical diagnosis and care information. Conversely, data collected for a clinical trial may not be indicative of typical costs of treatment options. It is worth noting that data sets from clinical trials may have their own quality issues (Saijo et al., 2006). Although for the sake of efficiency, analysts would like to leverage available data sets for newer applications, the resources needed to collect new data should be balanced with the risk of existing data being irrelevant or out-of-context. One solution to the out-of-context problem is to encourage creators and the users of the data to develop a deep understanding of the phenomenon and the context that gives rise to data and document the understanding as meta-data to support the future use of data. In the age of Big Data and massive databases it has become difficult for an individual analyst to develop, maintain and share quality understanding of the data from several sources. However, commercially available knowledge management tools (Copperman et al., 2008; IBM, 2014; Sukumar and Ferrell, 2013) that can archive meta-data in machine readable and human-searchable formats can enable the analysts to overcome this hurdle.


Data entry errors (human, human-machine interface, software tools) Data entry errors are all too common in health care. They are prevalent in diverse health care settings as illustrated by examples from the USA, Saudi Arabia and Iran (Goldberg et al., 2008; Shelby-James et al., 2007; AlJarallah and AlRowaiss, 2013; Fahimi et al., 2009). There are some key differences between data entry mechanisms in traditional Big Data vs Big Data in health care. The process of data creation in industrial and commercial settings outside of health care is highly automated, such as the recording and ordering of key strokes for web site access, financial transactions, collection of sensor data from automated processes, etc. However, in the health care world, most of the collection, integration and organization of data is primarily manual. Even if the structure of health data collected is rather rigorously defined, the mechanism for populating the original data is largely manual data entry. The critical difference is data input by humans who can both intentionally and unintentionally introduce systematic data errors. Incorrect entry of a name, address or key identification field such as social security number or insurance identification can lead to ambiguous data records that could be attributed to the wrong person or could lead to multiple records for a single person. When the data provided are biased through omission or inaccurate entries, correlations may be found that are inaccurate or important relationships could be missed (Adler-Milstein and Jha, 2013). Though data may be input via forms that attempt to limit data errors where possible, by providing drop-down menus wherever applicable, users can still accidentally pick the wrong selection. Forms may have some selections that pre-populate parts of the form that may not always be accurate, but do not get corrected. Certain data fields may be regularly used and populated by some practitioners or clerks and these same fields may not be routinely populated by others. Forms may also allow optional fields for entry. In such cases, completeness aspect of data quality is not well-defined. A possible solution to data entry errors is automation and mistake-proofing. Simpler user interfaces for data entry and domain-specific rule engines for error-checks have reduced data entry errors. Some commercial tools are SAS Data Flux (SAS Institute, 2011), Informatica’s Data Quality (Informatica, 2014a), Stanford’s Data Wrangler (Kandel, 2011) and Trifacta’s Data Transformation Platform (Trifacta, 2014). Open source data quality check tools are also available (Barateiro and Galhardas, 2005). On the other hand, it has been observed that automation can create data redundancies and data entry errors as well (Shelby-James et al., 2007). While making entries into EHR, physicians often use templates or copy-paste commands to generate text that will comply with guidelines and regulations of the insurers such as Medicare. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data – all of which are common practices – can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid. Furthermore, there can be ambiguity in semantic description of diagnosis and procedures across physicians and how billing agents encode the semantic variations in claim forms. When reimbursements are at stake, there could be an incentive to even deliberately distort the data. For instance, physician may mention treatment for Herpes and the billing agent is highly likely to choose a treatment for Herpes-2 vs Herpes-1, which may not be reimbursed.

Quality of Big Data in health care 625

IJHCQA 28,6


626

Diversity and evolving standards Health care data often consists of codes from several referential codebooks. For example, race and gender codes in beneficiary data sets; medical practice taxonomy and specialty codes in provider data sets, diagnosis and procedure codes (Current Procedural Terminology (CPT), International Classification of Diseases (ICD)) in claim data sets, the National Drug Code codes on prescription drug events (World Health Organization, 2014; Food and Drug Administration, 2014). The CPT is a uniform coding system consisting of descriptive terms and identifying codes that are used primarily to identify medical services and procedures furnished by physicians and other health care professionals (Centers for Medicare and Medicaid Services, 2014). These health care professionals use the CPT to identify services and procedures for which they bill public or private health insurance programs. While some of these codebooks may be standardized, codebooks can also be specific to a data system and codebooks evolve over time. Medical data sets may sometimes use several standards – some 50 years or older. Suppose a decision maker poses the question – what is the distribution by race of patients undergoing heart surgery? The corruption of the data because of the variety in the codebooks and standards makes it impossible to produce a reliable answer. To illustrate, let us consider the following complicating issues for such an analysis, first, Hospital A uses a codebook with nine race codes, while Hospital B uses a codebook with only five race codes, second, Hospital C could be using ICD-9 while some clinics have transitioned to ICD-10, third, old software systems still use CPT codes for procedure claims although most insurers prefer the ICD system. In addition, some legacy systems may not have kept up with evolving standards and even newer modern systems may not be sufficiently flexible to incorporate evolving standards. A solution, albeit expensive, is to use commercially available data products and services that ensure compatibility and equivalence by mapping diverse codes to the extent possible (Find-A-Code, 2014). The ideal solution to prevent this issue would be a capability to look up a date-indexed centralized repository for every codebook in the health care universe. Data staging errors Serious quality errors can occur in the pre-processing and staging of data for analysis. Usually, data staging can involve data migration, integration, machine-to-machine translation and database-to-database conversion. As data are prepared for analytics, it is cleaned and transformed. Common cleaning and transforming operation examples include removal of trailing and leading whitespace, standardization of number of leading zeroes for some identifying numbers, address standardization routines, and ensuring data meets the constraints for that field. Decisions made during the extract, transform, load (ETL) process – such as using metric units vs English units, allowing the choice of leaving a key cost-field blank (which could be encoded in a legacy system as “88888” or “99999”) can have downstream ramifications in the analytics workflow. When data have to go through multiple ETL processes in the business workflow, relationships between entities (e.g. patient-claim, patient-provider and provider-claim) can be lost or corrupted. The ETL process during data integration from multiple sources can propagate errors. As electronic submissions are accepted and merged from different organizations or even different sources in the same organization, certain data cleaning and transformation operations are initiated to prepare data for storage and analysis. For example, if a field is supposed to hold a date, checks are made that the data supplied are of proper size, value and format to translate to


a valid date that conforms to the constraints in the new database. Dates that are unrealistic can be observed e.g., June 30, 1802 for date of birth, a string of characters, that is not a valid date, or just an expiration date to represent an open-ended time in the future such as December 31, 2099. If the data does not translate to a valid date, the ETL process would enforce the rule that field value may be changed to a pre-determined value, or left blank. In some situations, where the date is a key important field, the entire record may be rejected and flagged. If an analyst is looking for data that occurred within a date range and that data are not available for a large number of records (lost during the ETL process), this could have a significant impact on the analysis and conclusions. A situation that could occur during integration of claims data from two hospitals is the following. Two hospitals handling two different payers (e.g. Medicare, Medicaid and BlueCross) that use the same standardized structure for filing claim forms may have a different adjudication process. Although, the data and its organization may look similar post-integration, the system has to account for the fact that the adjudication process is not. Otherwise, the system cannot ascertain if there are duplicate claims leading to financial implications. A procedure to avoid error propagation, maintain and improve data quality throughout the analytical workflow is to check the data for anomalies and outliers by computing summary statistics, maximum and minimum at every ETL process in the analytical workflow. Data quality inspection can be done by using logical constraints that are derived from interaction with subject matter experts. Trained quality analysts can then use interactive graphics and exploratory data analysis tools such as stem-and-leaf and box plots to look at the data in different ways. They can then flag outliers for further investigation and action. Effective inspection of data using the above tools requires skills in knowing what to look for and how to recognize the anomalies. These skills relate back to the familiarity with and the understanding of the data generation process (Snee and Hoerl, 2012). In the Big Data world, Apache’s Hadoop (Holmes, 2012) and SAP’s Hana (Färber et al., 2012) are relatively inexpensive software tools for the ETL processes related to data quality. Entity resolution Health care data involves a complex web of entities such as providers, patients, payers and regulators. There is a critical need to know and track every entity within the system with a high degree of confidence. Often referred to as the identity disambiguation problem, it is one of the major, if not the toughest data quality challenge in health care. Accurate association of health care episodes of a patient who may be visiting multiple health care providers is absolutely essential to documenting and retrieving a complete history of health-related events. For example, if a patient happens to have two near-identical records such as John Doe and John H. Doe in the system and therefore different care episodes get assigned to each record, neither of these identities will provide a complete record of patient’s health history. This can have serious consequences from a care perspective. Many medical errors and lawsuits center around complications arising from incomplete patient history (AHRQ, 2003; Becher and Chassin, 2001; Gallegos, 2011, 2013; Texas Medical Liability Trust, 2009). Another consequence of low-data veracity with respect to identity disambiguation would be allowing fraudulent providers to hide within the system using multiple identities. Fraud detection software will not be able to find such providers because the suspicious activity can be masked as multiple instances of normal activity, while in reality, it is actually the handiwork of one fraudulent individual.


IJHCQA 28,6

628

Entity resolution products such as IBM’s Initiate (IBM, 2011) and Informatica’s Master Data Management Platform (Informatica, 2014b) can identify records from different sources representing the same real-world entity. These commercially available packages perform a probabilistic match of an entity’s key data (e.g. data of birth, social security number, address, phone number, etc.) to discover rules that define when two entry records can be linked or merged with confidence. Maintaining an active master data management solution to track patients, providers and changing health insurance coverage can resolve entity resolution ambiguities.


Road map for health data quality assurance Thus far, the different sources of errors along with their consequences in the health data lifecycle were discussed. Although the primary sources of data quality issues are data collection and integration interfaces, the solution for better quality assurance in health care analytics relies on organizational support for technological innovations and quality management (Arts et al., 2002). Toward that end, we provide a road map in Figure 2 as a systematic approach to data quality assurance for health care enterprises willing to embrace Big Data technologies – where the business impact and value derived from the use of analytics are going to be the primary determinants of investment in data quality (Loshin, 2010). Organizational guidelines for investing in infrastructure, hardware, software and personnel for health data quality assurance Modernize legacy systems: legacy systems are very commonplace in health care and a major source of quality issues that stem from their rigidity with data storage formats making it difficult to apply data cleaning solutions. Legacy systems and software were built with the assumption that data presented to these systems are going to be clean and perfect and schemas that represent the data will rarely change over time. Modern systems that are Enterprise-Wide Data Governance

Assessing Relevance and State of Quality

Quality-Aware Data Models And Process

Evaluate, Document and Disseminate Lessons Learned

Business model knowledge

Form data governance committees and standards

Evaluate current and future storage, processing and analysis requirements and potential tools

Modernize legacy systems if possible

Figure 2. Road map for incorporating data quality for health care organizations embracing Big Data and analytics

Decide on centralized or federated quality control strategy

Gather requirements and goals of analytical need

Establish code / field translation principles

Identify and characterize data resource(s)

Identity and select health coding standards and demographic standards for data storage

Evaluate adaptability of solution to varying data availability, varying refresh rates or partial system failures

ORGANIZATIONAL

Use social collaborative tools to record,discuss quality problems and solutions

Develop schema to manage for de-duplication, impartial or erroneous records

Disseminate issues and lessons learned to current and past users of data

Deploy Master-data management / Entity resolution tools

Allocate computational resources for qualityrelated statistical analysis

Design methodologies for tracking domain specific qualityrelated interventions on data and archive meta-data of transformations made along with quality issues faced and resolved.

TECHNOLOGICAL


more flexible and quality control friendly, face resistance to their deployment into their operational environments. The fear of disrupting existing critical functions and compliance with health care regulations (e.g. Health Insurance Portability and Accountability Act in the USA and Personal Information Protection Act in Japan) is a major hurdle to modernization. However, system modernization is well-worth it in the Big Data era of data-driven decision making. Our recommendation is that when technological innovations are approved, become necessary or mandated, a quality-conscious modernization plan should be adopted. Some goals of such modernization are ensuring conformity to data standards for codebooks and commitment to software applications for transformation and migration of data from legacy systems. This recommendation does not obviate the need to consider the special requirements of health data such as confidentiality and privacy. The system design has to be flexible enough to accommodate the requirements discussed in Lee and Gostin (2009) and Hoffman and Podgurski (2012). Enforce enterprise-wide quality management strategy: depending on the data volume and distributed nature of operations, health care organizations can choose between the centralized and the federated approaches to data quality. The centralized approach works best in situations where organizational Big Data only produces “Bigger Data” over time while the federated approach works best when organizations are standards-driven (e.g. single vendor software and conformity in process using standards-based management). In the practitioner community, there is a debate on strategies of quality control – one camp arguing that for large volume data where distributed analytics may not work, a centralized approach would be appropriate and another camp willing to explore a well-managed federated quality control at the source. In the health care domain where often analysis is of value on consolidated/integrated data sets, quality control implementation in a centralized fashion may be more advantageous from the standpoint of assuring consistency. A detailed concept-of-operations to decide between different models to implement quality control is described in the Data Management International (DAMA) guide (DAMA-DMBOK, 2009). DAMA is an independent association of information handling professionals. Implement data governance: a good first step toward governance of enterprise health data are setting up committees and data-quality-standard management systems that are held responsible for administering and managing quality expectations. The data governance committees evaluate, develop and adopt standards that will be used enterprise wide. The committees will also advise on new analytical workflows with regards to relevance, context and applicability of the data elements specific to the analytical need. They also investigate different sources of errors within and outside the enterprise, study best practices in the IT industry and introduce tools, methods and process to address data quality issues. They will function like the institutional review boards that enforce HIPPA privacy considerations in the USA. By carefully considering data quality issues at every stage of the analytical workflow (sample extraction, data integration, query construction, etc.), documenting and assessing quality risks, the governance committee provides feedback in raising the quality of data stored while improving the integrity of analytic results. Technological guidelines for implementing quality-tracking data models and process for quality-aware analytics Assessing relevance and state-of-quality: as a best practice, every time a new data store is created, a new data source is added or a new data set is integrated to an existing data warehouse, the analytical question driving the data integration should be discussed in


IJHCQA 28,6


630

the context of the state of data quality. This study of what is to be accomplished includes the identification and quantification of resources available to accomplish the analysis. Tasks such as describing business model knowledge, gathering requirements and the types of questions expected to be answered based on the data are first steps in this stage. An understanding of the data sets to be utilized, their fitness to the desired analysis and the risk of data quality issues on the data elements should be assessed. This understanding can be aided by a statistical evaluation of a sample of the data sets and their key structural elements. In addition, an evaluation of the volume and update rate of data from each source, with its impact on analytical validity should be made. A set of proven quality analysis tools and/or data cleaning algorithms can be utilized to assure the continued maintenance of high-quality data. Organizations should allow and budget for human analysts in the loop as required. Designing quality-aware data models and process: quality assurance committees should enforce standardized data structure and development or selection of standard data dictionary definitions and data specifications. The knowledge of the data elements and the selection of core demographic and health coding standards must be evaluated for suitability and conformance to the analytic requirements. Human analysts can intervene and identify methods to allow flexibility in selection or choose domain-specific data transformations. To the extent possible, such domain-specific interventions should be tracked in the data model for future reference. For example, consider an analytical need that utilizes a number of legacy resources. There are a number of important processes to put in place: establishment of field and code translations from the various sources to those specified in the data model and data dictionary for the composite data; master data management and/or entity resolution for various health care entities and patients or beneficiaries; de-duplication of records, managing of updated records and management of partial records or invalid field values. A process for evaluation of sampled data sets and field values produced by transformation processes should be developed to check for changes in data that may indicate a transformation bug or a change in process not expected or documented. Evaluate, document and disseminate data quality lessons learned: a challenging aspect of quality control is when technology is the source of problems. Data quality issues can sneak in from a variety of possible sources such as non-compliance to standards between hardware platforms, differing uptime performance, time-constrained quick-fixes by database administrators, varying data availability, varying refresh data rates, system failures, etc. Organizations should implement policies for statistical checks on data to evaluate irregularities in the process by allocating computational resources for quality-related statistical analysis. Once these technological issues are identified, they need to be documented and made easily available for future. The quality control document maintains an archive of the different quality issues resolved over time, and guarantee that issues identified in the document have been addressed in the data system and that records are able to verify quality checks performed on them. Collaborative documentation tools (e.g. wikis and forums) for analysts and system administrators will foster the ease-of-use, adoption and engagement toward quality assurance and dissemination of quality-related lessons to current, future and past users of data. The road map discussed above can serve as a template for designing program initiatives in organizations to institute effective data quality standards, policies and investments in the context of Big Data analytics. This can lead to better decisions that translate into improved health care cost, quality and delivery.


Conclusions and recommendations The paper discussed major sources of data quality errors in health care, the potential consequences of quality issues in the insights and interpretations drawn based on the analytics, and best practices and tools that address these issues. These discussions lead to the following conclusions and recommendations. It is to be noted that the sources of errors discussed in this paper are not exhaustive. Other potential sources of data quality problems that may be more technical and domain specific were considered beyond the scope of this paper: •

Today, data quality issues are diagnosed and addressed in a piece-meal fashion. A data lifecycle approach which is more appropriate with the dimensions of Big Data and fits different stages in the analytical workflow is recommended.

•

It may be noted that commercial tools for data quality assessment and management may be expensive, but open source alternatives are available. Open source alternatives may take longer to implement and need dedicated resource support for deployment.

•

Automation in the form of data handling, storage, entry and processing technologies is to be viewed as a double-edged sword. At one level, automation can be a good solution, while at another level it can create a different set of data quality issues.

•

Health care quality problems can be so very specific that organizations might have to build their own custom software or data quality rule engines. Commercial software tools will still need use/case-specific modifications.

•

Analytical insights in health care should always be probed for data quality problems. This is emphasized because most analytical tools assume that the data are of very high quality. State of the art analytical algorithms are not robust to the poor quality of the data.

This paper has addressed several problem areas that a data analyst has to be wary of. It has identified opportunities for research in data quality assessment. It provides a road map to incorporate data quality principles into the initial system development. This paper has not addressed the implementation of the proposed road map. Future research will investigate and identify organizational and technological factors that are critical to achieving the specified data quality. References Adler-Milstein, J. and Jha, A.K. (2013), “Healthcare’s ‘Big Data’ challenge”, The American Journal of Managed Care, Vol. 19 No. 7, pp. 537-538. AHRQ (2003), AHRQ’s Patient Safety Initiative: Building Foundations, Reducing Risk, December, Agency for Healthcare Research and Quality, Rockville, MD, available at: http://archive. ahrq.gov/research/findings/final-reports/pscongrpt/psini2.html (accessed 19 April 2015). AlJarallah, J.S. and AlRowaiss, N. (2013), “The pattern of medical errors and litigation against doctors in Saudi Arabia”, Journal of Family and Community Medicine, Vol. 20 No. 2, pp. 98-105. Aljunid, S.M., Srithamrongsawat, S., Chen, W., Bae, S.J., Pwu, R.F., Ikeda, S. and Xu, L. (2012), “Health-care data collecting, sharing, and using in Thailand, China Mainland, South Korea, Taiwan, Japan, and Malaysia”, Value in Health, Vol. 15 No. 1, pp. S132-S138.


IJHCQA 28,6


632

Arts, D.G., De Keizer, N.F. and Scheffer, G.J. (2002), “Defining and improving data quality in medical registries: a literature review, case study, and generic framework”, Journal of the American Medical Informatics Association, Vol. 9 No. 6, pp. 600-611. Barateiro, J. and Galhardas, H. (2005), “A survey of data quality tools”, Datenbank Spektrum, Vol. 14, pp. 15-21. BDSSG (2015), “Big Data Senior Steering Group (BDSSG)”, available at: www.nitrd.gov/ nitrdgroups/index.php?title¼Big_Data_(BD_SSG)#title (accessed 19 April 2015). Becher, C. and Chassin, M. (2001), “Improving quality, minimizing error: making it happen”, Health Affairs, Vol. 20 No. 3, pp. 68-81. BIG (2015), “Big Data Public Private Forum”, available at: www.big-project.eu/ (accessed 19 April 2015). Centers for Medicare and Medicaid Services (2014), “Current Procedural Terminology (CPT)”, available at: www.cms.gov/Medicare/Coding/MedHCPCSGenInfo/index.html (accessed 12 October 2014). Copperman, M., Angel, M., Rudy, J.H., Huffman, S.B., Kay, D.B. and Fratkina, R. (2008), “System and method for implementing a knowledge management system”, US Patent No. 7,401, 087 B2, July 15, 2008. DAMA-DMBOK (2009), The DAMA Guide to Data Management Body of Knowledge, 1st ed., Technics Publication LLC, Denville, NJ. Europe 2020 (2014), “Discussion Big Data and healthcare: ‘a new knowledge era in the world of healthcare’ ”, available at: https://ec.europa.eu/digital-agenda/en/news/discussion-big-dataand-healthcare-new-knowledge-era-world-healthcare (accessed 19 April 2015). Fahimi, F., Abbasi, N.M., Abrishami, R., Sistanizad, M., Mazidi, T., Faghihi, T., Soltani,T. and Baniasadi, S. (2009), “Transcription errors observed in a teaching hospital”, Archives of Iranian Medicine, Vol. 12 No. 2, pp. 173-175. Färber, F., Cha, S.K., Primsch, J., Bornhövd, C., Sigg, S. and Lehner, W. (2012), “SAP HANA database: data management for modern business applications”, ACM SIGMOD Record, Vol. 40 No. 4, pp. 45-51. Find-A-Code (2014), “Find-A-Code”, available at: www.findacode.com/search/search.php (accessed 21 July 2014). Food and Drug Administration (2014), “National Drug Code (NDC)”, available at: www.fda.gov/ Drugs/InformationOnDrugs/ucm142438.htm (accessed 12 October 2014). Gallegos, A. (2011), “Communication key to reducing liability claims in patient handoffs”, American Medical News, available at: www.amednews.com/article/20110620/profession/ 306209947/5/ (accessed 19 April 2015). Gallegos, A. (2013), “Medical charting errors can drive patient liability suits”, American Medical News, available at: www.amednews.com/article/20130325/profession/130329979/5/ (accessed 19 April 2015). Goldberg, S.I., Niemierko, A. and Turchin, A. (2008), “Analysis of data errors in clinical research databases”, AMIA Annual Symposium Proceedings, Vol. 2008, pp. 242-246. Groves, P., Kayyali, B., Van Kuiken, S. and Knott, D.(2013), “The ‘Big Data’ revolution in health care: accelerating value and innovation”, McKinsey and Company, April, available at: www.mckinsey.com/insights/health_systems_and_services/the_big-data revolution_ in_us_health_care (accessed 21 July 2014). Hoffman, S.and Podgurski, A. (2012), “Balancing privacy, autonomy, and scientific needs in electronic health records research”, Case Legal Studies Research Paper No. 2011-22, 65 Southern Methodist University Law Review 85, available at SSRN: http://ssrn.com/ abstract¼1923187 (accessed 11 May 2015).


Holmes, A. (2012), Hadoop in Practice, Manning Publications Company, Shelter Island, NY. IBM (2011), “Initiate Work Bench User’s Guide”, available at: http://pic.dhe.ibm.com/ infocenter/initiate/v9r5/topic/com.ibm.initiatepdfs.doc/topics/i46wecug.pdf (accessed 21 July 2014). IBM (2014), “Enterprise content management”, available at: www-03.ibm.com/software/products/ en/category/enterprise-content-management (accessed 21 July 2014). Informatica (2014a), “Data quality”, available at: www.informatica.com/us/products/data-quality/ (accessed 21 July 2014). Informatica (2014b), “Master data management”, available at: www.informatica.com/us/products/ master-data-management/mdm/ (accessed 21 July 2014). Kandel, S., Paepcke, A., Hellerstein, J. and Heer, H. (2011), “Wrangler: interactive visual specification of data transformation scripts”, The ACM Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, pp. 3363-3372. Laney, D (2001), “3D data management: controlling data volume, velocity and variety”, Application Delivery Strategies, META Group, February 6, available at: http://blogs. gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-DataVolume-Velocity-and-Variety.pdf (accessed 11 May 2015). Lee, L.M. and Gostin, L.O. (2009), “Ethical collection, storage, and use of public health data: a proposal for a national privacy protection”, Journal of the American Medical Association, Vol. 302 No.1, pp. 82-84, available at: http://jama.jamanetwork.com/article.aspx? articleid¼184159 (accessed 19 April 2015). Lighter, D. and Bradley, R.V. (2013), “The future of analytics in healthcare”, Presentation at the Decision Sciences Institute 2013 Annual Conference, 17 November, Baltimore, MD. Liu, S. (2014), “Breaking down the barriers”, Quality Progress, Vol. 47 No. 1, pp. 16-22. Loshin, D. (2010), “Data quality fundamentals”, available at: www.dama-ny.com/images/meeting/ 041510/dqprogram.pdf (accessed 19 April 2015). Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, R. and Byers, A.H. (2011), “Big Data: the next frontier for innovation, competition, and productivity”, McKinsey Global Institute Quarterly, May, available at: www.mckinsey.com/insights/business_ technology/big_data_the_next_frontier_for_innovation (accessed 21 July 2014). Merriam-Webster Dictionary (2006), The Merriam-Webster Dictionary, Merriam-Webster Incorporated, Springfield, MA. Murakami, A., Hirata, Y., Motomura, N., Miyata, H., Iwanaka, T. and Takamoto, S. (2014), “The National Clinical Database as an initiative for quality improvement in Japan”, The Korean Journal of Thoracic and Cardiovascular Surgery, Vol. 47 No. 5, pp. 437-443. Murdoch, T.B. and Detsky, A.S. (2013), “The inevitable application of Big Data to health care”, Journal of American Medical Association, Vol. 309 No. 13, pp. 1351-1352. Royal College of Physicians (2006), “Engaging clinicians in improving data quality in the NHS”, available at: www.rcplondon.ac.uk/resources/engaging-clinicians-improving-data-qualitynhs-ilab-project-summary (accessed 19 April 2015). Safran, C., Bloomrosen, M., Hammond, W.E., Labkoff, S., Merkel-Fox, S., Tang, P.C. and Detmer, D.E. (2007), “Toward a national framework for the secondary use of health data: an american medical informatics association White Paper”, Journal of the American Medical Informatics Association, Vol. 14 No. 1, pp. 1-9, available at: http://dx.doi.org/10.1197/jamia.M2273 (accessed 19 April 2015). Saijo, H., Kasai, H., Takahashi, H., Harigai, M. and Takase, K. (2006), “Fundamental data quality assessments of clinical trials in Japan”, Journal of Medical and Dental Sciences, Vol. 53 No. 1, pp. 17-25.


IJHCQA 28,6


634

SAS Institute (2011), “Data flux”, SAS Data Integration Studio 4.3: User’s Guide, pp. 521-541, SAS Institute Inc., Cary, NC, available at: http://support.sas.com/documentation/cdl/en/etlug/ 63360/PDF/default/etlug.pdf (accessed 21 July 2014). Shelby-James, T.M., Abernethy, A.P., McAlindon, A. and Currow, D.C. (2007), “Handheld computers for data entry: high tech has its problems too”, Trials, Vol. 8 No. 5, pp. 1-2. Snee, R.D. and Hoerl, R.W. (2012), “Inquiry on pedigree”, Quality Progress, Vol. 45 No. 12, pp. 66-68. Snee, R.D., DeVeaux, R.D. and Hoerl, R.W. (2014), “Follow the fundamentals”, Quality Progress, Vol. 47 No. 1, pp. 24-28. Sukumar, S.R. and Ferrell, R.K. (2013), “ ‘Big Data collaboration’: exploring, recording and sharing enterprise knowledge”, Information Services and Use, Vol. 33 No. 3, pp. 257-270. Svolba, G. (2014), “Missing values”, Analytics-Magazine.Org, Vol. 6 No. 1, pp. 58-65. Texas Medical Liability Trust (2009), “Failure to refer when appropriate, failure to track referrals, and failure to communicate with referring physician”, 10 Things That Get Physicians Sued, Austin, TX, available at: http://impertinentremarks.com/wp-content/uploads/2012/ 10/Ten-things.pdf (accessed 19 April 2015). The Economist (2013), “The health paradox”, The Economist, May 11, p. 28, available at: www.economist.com/printedition/2013-05-11 (accessed 19 April 2015). Thomson, S., Osborn, R. and Jun, M. (Eds) (2013), International Profiles of Healthcare Systems, Report No. 1717, The Commonwealth Fund Publisher, New York, NY. Trifacta (2014), “Data transformation platform”, available at: www.trifacta.com/ (accessed 21 July 2014). World Health Organization (2014), “International Classification of Diseases (ICD)”, available at: www.who.int/classifications/icd/en/ (accessed 12 October 2014).

Corresponding author Dr Ramachandran Natarajan can be contacted at: [email protected]

For instructions on how to order reprints of this article, please visit our website: www.emeraldgrouppublishing.com/licensing/reprints.htm Or contact us for further details: [email protected]

Small and big quality in health care.

Learning from big health care data.

Men's health big data.

Harnessing big data for health.

Using electronic health records for surgical quality improvement in the era of big data.

Turning big data into personalised diabetes care.

"Big data" and the electronic health record.

Medicine. Big data meets public health.

Big data and the electronic health record.

Big Data: Implications for Health System Pharmacy.

Publicly Available Data and Pediatric Mental Health: Leveraging Big Data to Answer Big Questions for Children.

Big Data in Health Care: An Urgent Mandate to CHANGE Nursing EHRs!

Big Data Application in Biomedical Research and Health Care: A Literature Review.

Confronting the ethical challenges of big data in public health.

Ethical challenges of big data in public health.

Harnessing big data for health care and research: are urologists ready?

Big Data in mental health: a challenging fragmented future.

Foreword: Big Data and Its Application in Health Disparities Research.

Toward a Learning Health-care System - Knowledge Delivery at the Point of Care Empowered by Big Data and NLP.

Beyond simple charts: Design of visualizations for big health data.

Big data.

Taking a 'Big Data' approach to data quality in a citizen science project.

Accessing critical care big data: a step by step approach.

Individualizing care for ovarian cancer patients using big data.