Protecting privacy in a clinical data warehouse.

Health Informatics Journal http://jhi.sagepub.com/

Protecting privacy in a clinical data warehouse Guilan Kong and Zhichun Xiao Health Informatics Journal published online 9 October 2014 DOI: 10.1177/1460458213504204 The online version of this article can be found at: http://jhi.sagepub.com/content/early/2014/10/07/1460458213504204

Published by: http://www.sagepublications.com

Additional services and information for Health Informatics Journal can be found at: Email Alerts: http://jhi.sagepub.com/cgi/alerts Subscriptions: http://jhi.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://jhi.sagepub.com/content/early/2014/10/07/1460458213504204.refs.html

>> OnlineFirst Version of Record - Oct 9, 2014 What is This?

Downloaded from jhi.sagepub.com at TEXAS SOUTHERN UNIVERSITY on October 16, 2014

504204 2013

JHI0010.1177/1460458213504204Health Informatics JournalKong and Xiao

Article

Protecting privacy in a clinical data warehouse

Health Informatics Journal 0(0) 1–14 © The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/1460458213504204 jhi.sagepub.com

Guilan Kong and Zhichun Xiao Peking University, China

Abstract Peking University has several prestigious teaching hospitals in China. To make secondary use of massive medical data for research purposes, construction of a clinical data warehouse is imperative in Peking University. However, a big concern for clinical data warehouse construction is how to protect patient privacy. In this project, we propose to use a combination of symmetric block ciphers, asymmetric ciphers, and cryptographic hashing algorithms to protect patient privacy information. The novelty of our privacy protection approach lies in message-level data encryption, the key caching system, and the cryptographic key management system. The proposed privacy protection approach is scalable to clinical data warehouse construction with any size of medical data. With the composite privacy protection approach, the clinical data warehouse can be secure enough to keep the confidential data from leaking to the outside world.

Keywords Clinical data warehouse, data encryption, hospital information system, privacy protection

Introduction In the last two decades, we have witnessed an ever increasing volume of data that is being collected in hospital information systems (HISs), especially in electronic medical records (EMRs). As clinical databases provide a rich source of data for research in areas such as medical care delivery and medical quality monitoring, how to make secondary uses of those collected large volume data has become a hot research topic in the literature.1–5 Abhyankar et al.6 proposed a method to standardize clinical laboratory data for secondary use. Tolar and Balka7 put forward advice about how to enhance care through secondary use of EMR data in a general practice setting. The Strategic Health IT Advanced Research Projects Area 4 Consortium (SHARPn) project funded by the US government in 2010 to build a robust infrastructure for secondary use of electronic health records (EHRs) data has taken shape.8 Driven by demand, data warehousing has been proposed as a way of supporting secondary use of those invaluable medical data maintained in current HISs.9,10 With the aid of data warehousing technologies, we can collect and integrate data from various HISs, and build multidimensional patient information cubes or data marts. Based on the raw, aggregate, or statistical information Corresponding author: Zhichun Xiao, Medical Informatics Center, Peking University, 38 Xueyuan Rd, Haidian District, Beijing, China, 100191. Email: [email protected]


2

Health Informatics Journal 0(0)

displayed in a clinical data warehouse (CDW), researchers may have various hypotheses tested, which can stimulate further research in healthcare-related issues. Several CDWs11,12 have been developed for this purpose. However, with the increasing complexity of integrating patient data from different departments or hospitals, and the keen desire of other entities such as insurance companies and pharmaceuticals companies to access these data, serious pressure is put on a CDW’s capability to protect patient privacy. Privacy violations may result in loss of dignity, and confidential information from a person’s medical record may influence his or her credit, employment, and the ability to get health insurance.13 Therefore, protecting patient privacy becomes a big challenge for CDW designers. In response to these concerns, various methods have been employed. From the policy and law side, several bills were introduced during the past two decades in some developed countries. Take the United States as an example; the Health Insurance Portability and Accountability Act (HIPAA) was signed into law in 1996, and component privacy regulations were published in December 2000.14 However, in China, due to the fact that China is a developing country, the most worrying aspect regarding healthcare services for citizens is medical cost, and only a small portion of the population may be conscious of their privacy in the process of seeing a doctor. Research on patient privacy protection is still in its initial stage, and aside from some graduates’ theses15,16 that discuss the issue of legislation to protect patient privacy, there is neither a specific bill nor regulation regarding patient privacy protection. From the technology side, first, user access control17,18 has been proposed as a measure to limit users’ access to confidential and sensitive data. Second, lots of methods for data encryption and pseudonymization have been proposed by researchers19–23 to protect privacy. In this article, we propose the use of composite methods to implement privacy protection in CDW, which is being developed at Peking University (PKU). In our composite approach, the privacy model is composed of various data encryption methods and a user access control mechanism in a layered CDW structure, and we mainly describe the composite data encryption approach in this article. The structure of this article is as follows: a brief introduction of the Peking University Clinical Data Warehouse (PKUCDW) and the necessity of protecting patient privacy in the CDW are given in the “Background and significance section”; the “Methods” section presents a composite privacy protection model designed for the PKUCDW; and finally, the “Conclusion” section summarizes the results and the contribution of this article.

Background and significance Background of PKUCDW PKU is affiliated with six prestigious teaching hospitals. Three of them are general hospitals, and three others are special hospitals specializing in cancer, dentistry, and mental disorder. All affiliated general hospitals have adopted HISs such as EMRs, laboratory information systems (LISs), pharmacy information systems (PISs), picture archiving and communication system (PACS), and so on. Motivated by making secondary use of large volume data collected in HISs, a research group at Peking University Medical Informatics Center (PKUMIC) with experts from computer science, decision making, data mining, statistics, and epidemiology was formed in 2011. A very important mission for PKUMIC is to centralize and integrate data from HISs located in all affiliated hospitals and to develop a CDW to extract new knowledge and valuable information from a large volume of medical data.


3

Kong and Xiao

End User Roles Epidemiologist

Clinical Researcher

Hospital Manager

Ă Ă Data Miner

On-line analycal processing (OLAP) Server

Paent

Medical Quality

Heart Disease

ĂĂ

Data Marts

Cubes

Data Warehouse

Data Warehouse

Extract, transform and load (ETL) Source Data

HIS Upload Sever (Affiliated hospital1)

HIS Upload Sever (Affiliated hospital2)

HIS Upload Sever

ĂĂ (Affiliated hospital6)

Figure 1. Structure of PKUCDW. PKUCDW: Peking University Clinical Data Warehouse.

Structure of PKUCDW We adopted the architecture proposed by Inmon24 to design our CDW. In this structure, a data warehouse (DW) should be a repository that provides data for data marts, which are created only after the creation of a complete DW. Inmon’s DW structure matches our requirement for a CDW very well. The main mission of PKUCDW is to warehouse the data in main HISs across all affiliated hospitals and to analyze the data from different research angles. The end users of PKUCDW include not only researchers at PKUMIC but also medical staff in affiliated hospitals. As different users would require data for analysis from their own perspectives, it would be better if we could provide them with different data marts or cubes according to their specific requirements. Here, a cube implies something very specific, while a data mart is more inclusive and it can have tables or cubes. The structure of PKUCDW is shown as in Figure 1.

Privacy concerns The biggest concern for the public about making secondary use of data in HISs is patient privacy violation. Besides those routine clinical data such as common clinical symptoms and laboratory results that are stored in HISs, confidential data such as Name, Address, Telephone number, Medical record number, Identification card number (IDN), Health insurance card number, and sensitive


4


data such as HIV/AIDS status information are also included. If these confidential and sensitive patient data get disclosed to the public or some malicious users, it may cause negative effects for the patients. Besides patients’ confidential and sensitive information, there are some privacy data of doctors that are stored in HISs as well. For example, detailed information such as Name, Address, Telephone number, and Employee’s card number of clinicians are stored in the HISs together with patient data. Thus, we have concerns about not only patients’ but also clinicians’ privacy in the design and development of PKUCDW. In our case, although all researchers and staff involved in the CDW project have signed agreements with PKUMIC or affiliated hospitals on privacy protection, there are still concerns about internal personnel disclosing, as all parties with strong interests in the EMRs data may make every possible attempt to access the data. Take pharmaceutical companies, for instance, as clinicians in different hospitals may have different preferences in prescribing medicines for patients, salesmen from pharmaceutical companies are keen to get EMR data to do analysis about which clinician in which hospital favors which medicines. With the above concerns in mind, it is necessary for us to employ technological methods together with administrative procedures to protect data privacy in the design and development of PKUCDW.

Overview of privacy protection methods In general, privacy protection can be enforced from two perspectives. From one perspective, protecting privacy needs to limit the number of users that can access the data, and from the other perspective, protecting privacy needs to limit the data that can be accessed. For limiting users’ access to data, different user access control mechanisms have been proposed and employed in the literature.17,18,25–27 For limiting the data that can be accessed by users, different data de-identification approaches have been developed.21,22,28,29 Procedures from administration perspectives and methods from technological perspectives have been proposed and employed in a complementary way in the literature.7,21,22,25,30 However, most methods employed in the literature are for medical data sharing or publishing, and they lack viable and practical approaches to comprehensively protect patient privacy from the beginning of HISs data transmission to the end of researchers’ data manipulations. Data encryption. The general approaches used to protect or to de-identify person-specific confidential and sensitive information include data encryption and data pseudonymization.7,19–21 As our study focuses on encrypting data to protect privacy, we only briefly discuss data encryption methods in the following. Data encryption can protect sensitive data from unauthorized access. Two widely used data encryption methods in the literature are symmetric block cipher and public key cryptography. A symmetric block cipher uses a permutation–substitution network to encrypt a fixed size of data block, with a predefined encryption key. The security of a block cipher depends on the length of the encryption key, and the time complexity of a brute force attack against a cipher of 256 bits key is O(2256).31 Unlike a symmetric block cipher, which uses one shared secret key between two partners, a public key cryptography system uses two keys, one private key, which is kept secret to itself, and one public key, which can be published to the outside world. The security of a public key cryptography relies upon the asymmetric behaviors of some well-known hard mathematical problems, for example, integer factorization and discrete logarithm problems. The public key cryptography technique provides security functionalities such as data encryption, key exchange, and digital signature.


5

Kong and Xiao

Methods Message-level encryption Since the patient data such as name, IDN, and other information are very critical and sensitive, we need to provide an end-to-end protection for them. The name and IDN must be encrypted at rest and in-transit, starting from the upload server of the hospital HIS system to the backend system in the CDW. Encrypting data at rest can effectively prevent sensitive data from being accessed by unauthorized users. Even if the malicious third party gained access to the electronic media, the data they receive are cipher-texts, whose original meaning cannot be revealed without proper cryptographic algorithms and encryption keys. Encrypting data in-transit can protect the data from eavesdropping when data are transmitted along wired or wireless communication channels. Without correct cryptographic algorithms and keys, the malicious third party cannot recover the original plain-text from the cipher-text in a reasonable amount of time. Encrypting data at rest can be implemented by a variety of techniques at different layers, from the lower level hard drive encryption to higher level application layer data encryption. Although hard drive encryption is an easy solution, we chose not to apply it in the CDW for several reasons. The first reason is that if all data on the hard disk are encrypted, they will have to be decrypted when being read by the applications, which results in a significant reduction in performance. The second reason is that the hard-disk encryption can only protect the data from breaching in some special cases such as when the system is powered off and the hard drives are stolen. When the system is up and running, the data on the hard drive is decrypted automatically and become cleartexts to end users, including malicious users who break into the system, which actually defeats the whole purpose of data encryption. Another solution is to use database layer encryption technologies to enforce data encryption at rest. The advantage of this solution is the ease of use and user transparency, which means existing applications can run smoothly without code change. However, a big disadvantage of this solution is that it is limited to a particular database product, which makes the interoperability between different products a big problem. Another issue is the limitation of the cryptographic key management capabilities of database column-level encryption solution, because existing solutions ordinarily use only one key for each table or for each column. If one key is compromised, then a bunch of data will be leaked. The third problem is that the database encryption solution can be applied only to structured data. However, there are large volumes of unstructured data, such as XML files, in the HIS systems. To enforce data encryption for these unstructured data, we need other technologies. Therefore, instead of using hard-disk encryption and database-level encryption, we chose to encrypt data at the message layer, which can effectively solve this issue. Encrypting data in-transit can also be enforced at different layers. The widely adopted network layer data security protocol IPsec and the transport layer protocol SSL/TLS (secure socket layer/ transport layer security) provide a transparent secure communication channel to end users. However, both of these techniques encrypt all the data transmitted through the channel, which consumes a lot of computing resources and requires considerable effort to set up the environment. When designing the security architecture of the CDW, we adopted an application-level data encryption solution to protect data at rest and in-transit at the same time. All the data in hospital HIS systems are first classified into two classes, either private or public. Private data are those data that contain sensitive information such as patient name, IDN, and so on, which need to be encrypted at rest and in-transit. Public data are those data that do not contain sensitive, personal information and can be accessed freely by a third party, which can be stored on physical devices and transmitted in clear-text format. Most of the clinical data are classified as public data and thus can be stored


6


Decrypted Source Data Epidemiologist Clinical Researcher Hospital Manager Ă Ă

CDW Key Server

On-line analycal processing (OLAP) Server End-to-End Data Encrypon

HIS Key Sever

Data Miner

Paent

Heart Disease

Medical Quality

ĂĂ

Data Warehouse Extract, transform and load (ETL) Encrypted Source Data

HIS Upload Sever HIS Upload Server HIS Upload Server (Affiliated hospital1) (Affiliated hospital2) ĂĂ (Affiliated hospital6)

Figure 2. Security architecture of PKUCDW. PKUCDW: Peking University Clinical Data Warehouse.

and transmitted in clear-text, which saves a lot of computing resources. Only a small portion of the clinical data is classified as private data and must be encrypted at rest and in-transit. Figure 2 shows the security architecture of the PKUCDW. The private data at the data source are encrypted at one end and can be decrypted at the other end, based on the end user’s security role and policy. Specifically, if the end user is a CDW developer with lower level access right to the data, he or she has to deal with encrypted data if he or she has no right to access decrypted data. The same rule applies to end users of CDW data.

Data encryption process The following pseudo code (Algorithm 1) is the encryption algorithm, which is used by the hospital upload servers. The input is private data and the output is the cipher-text. The encryption key generation and protection happens automatically behind the scenes, which is transparent to end users. Algorithm 1: Encrypt Private Data Into Cipher-Text Input: clear-text of private information Output: encrypted private information out EncryptData(in) { seedtime

Protecting location privacy for outsourced spatial data in cloud storage.

Protecting patient privacy when sharing patient-level data from clinical trials.

Characteristics desired in clinical data warehouse for biomedical research.

Piloting a deceased subject integrated data repository and protecting privacy of relatives.

Routes for breaching and protecting genetic privacy.

MouseMine: a new data warehouse for MGI.

Developing a standardized healthcare cost data warehouse.

Roadmap to a Comprehensive Clinical Data Warehouse for Precision Medicine Applications in Oncology.

Protecting the privacy of patient information in clinical networks: regulatory effectiveness analysis.

Preserving temporal relations in clinical data while maintaining privacy.

Data warehouse for detection of occupational diseases in OHS data.

Individual privacy versus public good: protecting confidentiality in health research.

Normalization of Phenotypic Data from a Clinical Data Warehouse: Case Study of Heterogeneous Blood Type Data with Surprising Results.

Electronically implemented clinical indicators based on a data warehouse in a tertiary hospital: its clinical benefit and effectiveness.

Leveraging a Statewide Clinical Data Warehouse to Expand Boundaries of the Learning Health System.

HDVDB: a data warehouse for hepatitis delta virus.

openPDS: protecting the privacy of metadata through SafeAnswers.

Protecting individuals; preserving data.

Genetic data sharing and privacy.

Validating emergency department vital signs using a data quality engine for data warehouse.

Clinical research data warehouse governance for distributed research networks in the USA: a systematic review of the literature.

Sharing Clinical Big Data While Protecting Confidentiality and Security: Observational Health Data Sciences and Informatics.

Data protection: Big data held to privacy laws, too.

DPSynthesizer: Differentially Private Data Synthesizer for Privacy Preserving Data Sharing.