Normalization of Phenotypic Data from a Clinical Data Warehouse: Case Study of Heterogeneous Blood Type Data with Surprising Results.

HHS Public Access Author manuscript Author Manuscript

Stud Health Technol Inform. Author manuscript; available in PMC 2017 July 10. Published in final edited form as: Stud Health Technol Inform. 2015 ; 216: 559–563.

Normalization of Phenotypic Data from a Clinical Data Warehouse: Case Study of Heterogeneous Blood Type Data with Surprising Results

Author Manuscript

James J. Cimino Laboratory for Informatics Development, the Clinical Center of the US National Institutes of Health, Bethesda, Maryland, USA

Abstract Clinical data warehouses often contain analogous data from disparate sources, resulting in heterogeneous formats and semantics. We have developed an approach that attempts to represent such phenotypic data in its most atomic form to facilitate aggregation. We illustrate this approach with human blood antigen typing (ABO-Rh) data drawn from the National Institutes of Health’s Biomedical Translational Research Information System (BTRIS). In applying the method to actual patient data, we discovered a 2% incidence of changed blood types. We believe our approach can be applied to any institution’s data to obtain comparable patient phenotypes. The actual discrepant blood type data will form the basis for a future study of the reasons for blood typing variation.

Author Manuscript

Keywords Clinical Data Repositories; Phenotype Detections; Blood Typing

Introduction

Author Manuscript

Clinical data warehouses are becoming a common tool for providing access to electronic health record data for various forms of reuse.[1] However, the reuse of historical data can be problematic, especially when data are pooled from multiple sources for which minimal documentation and metadata exist.[2] A challenge for clinical research informatics is to provide users with ways to reconcile such heterogeneous data.[2] For example, if one source codes gender as “male or female” and another source codes gender as “male, female, or other”, summarizing gender data across sources becomes problematic. While normalzation procedures are standard for genetic sequence data,[3] they have been less well defined for phenotypic data, typically involving ad hoc mappings between patient characteristics.[4] The purpose of this paper is to present a method for normalization of heterogeneous data by reducing them to specific, well-characterized phenotypic traits. We take as our example common ABO and Rh blood typing such as is typically tested in blood bank laboratories.

Address for correspondence: James J. Cimino, M.D., Director, Informatics Institute, University of Alabama at Birmingham School of Medicine, 1720 2nd Ave South, Birmingham, AL 35294-0113 USA, [email protected].

Cimino

Page 2

Author Manuscript

Background The Biomedical Translational Research Information System

Author Manuscript

The Clinical Center of the US National Institutes of Health (NIH) is a hospital in Bethesda, Maryland that has served as a site for clinical research since 1953. Over the years, data from clinical studies at NIH have been captured in a variety of clinical trials data management systems, as well as two electronic health records – one in operation from 1976 to 2004 and one in operation since 2004. These data, on over 500,000 human subjects from over 50 source systems are collected into a single database that forms the core of the NIH’s Biomedical Translational Research Information System (BTRIS).[5] All terms from source systems are assigned unique concept identifiers in BTRIS’s Research Entities Dictionary (RED), which is a unified ontology that includes hierarchical classifications of similar terms (such as tests that measure the same substance or medications that contain the same ingredients). BTRIS users select specific terms or classes of terms from the RED to use in retrieving identified data on their own clinical studies, as well as de-identified data across all clinical studies.[5] Case Study: ABO and Rh Blood Type Antigens Human red blood cells express a wide variety of antigens. Three in particular (A, B and Rh) are commonly identified in clinical laboratories for purposes such as cross-matching blood for transfusion. Blood donors and blood recipients are characterized as having A, B, AB or O blood and as being Rh positive or negative. Except in rare instances, an individual’s blood type does not change over his or her lifetime.

Author Manuscript

Today, most blood banks report these antigens together as the result of a single test; e.g., A+, B−, AB+, O−, etc. However, in the past, these results were reported across two tests: the ABO test to report types A, B, AB and O, and the Rh test with the possible results “positive” and “negative”. Even earlier, the ABO test results were reported as two separate tests for the A and B antigens. A patient could have a positive results for both of these tests (type AB), one of these tests (type A or B), or neither of these tests (implying type O). This heterogeneity has been well-documented as a challenge for dealing with pooling data from, or sharing them between, multiple health care sites.([6], also: Huff SM, personal communication; inspiring [7]) As a repository of data from multiple sites over 40 years, BTRIS often presents users with this type of challenge. We chose it as a case study for the normalization of heterogeneous data, since proper normalization should result in patients having consistent blood types over time, in effect serving as their own gold standards.

Author Manuscript

Methods Approach to Phenotype Normalization Our approach to phenotypic characterization involves reducing complex findings to their most atomic forms and them assembling them in canonical ways into more complex phenotypic patterns. In the case of blood typing, we consider that each test provides evidence for the presence or absence of at least one red blood cell antigen. This can range

Stud Health Technol Inform. Author manuscript; available in PMC 2017 July 10.

Cimino

Page 3

Author Manuscript

from a single antigen (A, B, or Rh) to all of them (for example, type “AB+” indicates the presence of all three and type “O−” indicates the absence of all three). Preliminary Analysis of Result Types The first step in our normalization process was to identify the relevant tests panels, individual tests and actual results reported by those tests. We used the BTRIS Limited Data Set function[8] to retrieve de-identified data from BTRIS, using appropriate terms selected from the RED.

Author Manuscript

The second step was to review each unique panel-test-result triple to determine the antigenic evidence it provides. Each result was tagged with the letters A, B and R, with presence indicated by an upper-case letter and absence indicated by a lower case letter. Thus, the a BAntigen test with the result “Positive” was labeled as “B” and a Type and Crossmatch test with the result “O+” was labeled as “abR”. A MUMPS data structure (PC-MUMPS, DataTree Inc., Waltham, MA), was created for each result and its assigned antigens. Analysis of Patient Data for Phenotype Consistency In the third step, the relevant data of individual patients were characterized based on the assigned antigens. Pooled evidence for the presence or absence of each antigen was stored for each patient in a second MUMPS global, such that a patient would typically have three letters (one each of “A” or “a”, “B” or “b”, and “R” or “r”). In the fourth step, we reviewed the results for each patient to determine situations where an individual did not have exactly one of each letter.

Author Manuscript

All data were obtained with oversight of the NIH Office of Human Subjects Research Protection (Agreement Number BTRIS_2014_835_CIMINO_J_CC). Only those data that did not require permission for reuse from the original investigators, as per NIH policy, were retrieved. The patients’ birthdates and test dates were included in the data but all other potentially identifying information was removed, as per NIH policy for limited-use data sets.

Results Preliminary Analysis of Result Types A search of the RED for tests with names containing the phrase “blood group” identified 644 terms, including three appropriate term classes as shown in Figure 1: ABO Grouping

Author Manuscript

and/or Rh Antigen Phenotyping Intravascular Test, Blood Group Antigen Blood Typing Blood Bank Test, and RH Blood Group System Antigen (Rh Factor) Test. When BTRIS was queried with these three terms, 593,637 test results were found on 43,485 patients (see Figure 2). The data included results from 139 tests in 66 test panels, with 334 unique paneltest pairs that reported 3946 unique results (Figure 3). Review of all panels and tests identified 21 panels and 59 tests within those panels that provide information on ABO and Rh blood typing. The data set included 1452 unique test results for these tests (many of which were misspellings, as in [6]). Each was reviewed manually to assign antigenic evidence. Table 1 shows a sample of panels, tests, results and assigned antigenic evidence. Stud Health Technol Inform. Author manuscript; available in PMC 2017 July 10.

Cimino

Page 4

Assignment of Phenotype Based on Antigenic Evidence

Author Manuscript

After removal of irrelevant panels and tests, the patient data included 165,981 panel events with 307,884 test results for 43,485 patients. Pooling of all antigenic evidence for each patient identified 32 different phenotypes, of which 8 were complete (ABO and Rh designations), 6 were incomplete (ABO or R designation only) and 18 were discrepant (multiple ABO and/or Rh designations). Table 2 shows the antigenic evidence, and counts for each phenotypic designation. Review of Aberrant Phenotypes

Author Manuscript

In all, 531 of the 43,485 patients (1.22%) had aberrant phenotypes, based on discrepant laboratory results. Some examples of their test results are shown in Table 3. Given that random, patient-independent laboratory errors could account for some of these discrepancies, and that the frequency of errors would be proportional to the number of tests run, we examined the frequency aberrant phenotypes versus number of test panels performed (Figure 4). The Pearson coefficient for this association is 0.7127 (P

Protecting privacy in a clinical data warehouse.

Data warehouse for detection of occupational diseases in OHS data.

MouseMine: a new data warehouse for MGI.

Developing a standardized healthcare cost data warehouse.

Detailed clinical modelling approach to data extraction from heterogeneous data sources for clinical research.

Metabolomics data normalization with EigenMS.

Characteristics desired in clinical data warehouse for biomedical research.

Protocol for a national blood transfusion data warehouse from donor to recipient.

Determination of Phenotypic Resistance Cutoffs From Routine Clinical Data.

PharmGKB Drug Data Normalization with NDF-RT.

Validating emergency department vital signs using a data quality engine for data warehouse.

A normalization technique for 3D PET data.

Visualizing the data - using lifelines2 to gain insights from data drawn from a clinical data repository.

HDVDB: a data warehouse for hepatitis delta virus.

Insights in Public Health: For the Love of Data! The Hawai'i Health Data Warehouse.

A case study of normalization, missing data and variable selection methods in lipidomics.

High-throughput flow cytometry data normalization for clinical trials.

Normalization of energy-dependent gamma survey data.

Learning from Data with Heterogeneous Noise using SGD.

Leveraging a Statewide Clinical Data Warehouse to Expand Boundaries of the Learning Health System.

Crowdsourcing data collection of the retail tobacco environment: case study comparing data from crowdsourced workers to trained data collectors.

Roadmap to a Comprehensive Clinical Data Warehouse for Precision Medicine Applications in Oncology.

SysBioCube: A Data Warehouse and Integrative Data Analysis Platform Facilitating Systems Biology Studies of Disorders of Military Relevance.

Temporal data representation, normalization, extraction, and reasoning: A review from clinical domain.