news and views

Understanding the links between privacy and public data sharing David W Craig

The rapid emergence of cost-effective genomic sequencing is providing unprecedented insight into the genetic variation between individuals. Driven by initiatives in precision medicine, researchers are increasingly focused on understanding the relationships among genetic

variations, human disease and clinical outcome. Frequently, germline genomic data are integrated with mRNA-seq and deep phenotypic data to provide further functional insight into inherited variation. However, because there are millions of genetic variants across

Phenotype database 1

eQTL → SNP database 2 HIV status

– – + –

Bob Pat

De-identify phenotypes

SNP 4

SNP 3

SNP 2

SNP 1

SNP 2 SNP 3 SNP 4

Make genotype predictions

Genotype database 3

npg

BRAF APOE NRAS

Genotype

Pat. 1

Pam

Pat. 2 Pat. 3

Jane

Pat. 4

SNP 4

Pat. 4

SNP 1

SNP 3

Pat. 2 Pat. 3

SNP

PTEN

SNP 1

Pat. 1

Gene

SNP 2

NRAS

APOE

BRAF

Expressor PTEN

© 2016 Nature America, Inc. All rights reserved.

Linking clinical and phenotype variables across data sets will both power precision medicine studies and introduce new privacy risks

HIV status

– – + –

Bob Pat Pam Jane

SNP 4

SNP 3

SNP 2

SNP 1

Re-identify database HIV status

– – – +

Figure 1 | Example of linkage attack in which de-identified gene expression data are provided with HIV status in phenotype database 2, and name and SNP data are provided in database 3. An eQTL-to-SNP database is used to link the databases, linking HIV status to subject name. Pat., patient.

David W. Craig is at the Neurogenomics Division, Translational Genomics Research Institute, Phoenix, Arizona, USA. e-mail: [email protected]

human populations, millions of research participants will be needed and will have to be willing to share highly personal and private information about their health and wellness in order for connections to be fully elucidated. Reaching these numbers will require data sharing and integration from multiple studies and cohorts. It is likely that many individuals will participate across a variety of studies sharing different types of private data. Linking data across studies may enable biomedical breakthroughs, but it may also increase privacy risks. Protecting privacy is not just an ethical and legal challenge; it is a major research problem. In this issue, Arif Harmanci and Mark Gerstein examine the increased privacy risk from linkage attacks when information about an individual is present in multiple high-dimensional genotypic and phenotypic data sets1. Their study focuses on exploiting outliers or extremes in one group of data to re-identify and link to other databases. They analyze the tradeoff between privacy leakage and the emergence of large, multidimensional RNA-seq genomic data sets, such as those emerging from the GTex, ENCODE and TCGA consortiums or through future efforts such as the Precision Medicine Initiative. Researchers are obligated to share results while maintaining reasonable expectations of privacy. However, there are particular challenges in areas of precision medicine where publically shared data sets are both complex and highly dimensional. In the initial genomewide association studies, it was assumed that aggregating data into summary statistics such as allele frequencies de-identified individuals from clinical cohorts2. However, work from our lab in 2008 showed that this assumption is surprisingly incorrect and that given a person’s genotype data, it is possible to determine cohort membership from single-nucleotide polymorphism (SNP) allele-frequency data3. In 2012, Im et al.4 extended these concepts to other summary-level measures such as expression quantitative trait loci (eQTLs), correlating genotype and expression data to determine study participation. Harmanci and Gerstein focus on another type of privacy breach, linkage attacks, in which sensitive personal data are linked to exploit correlated features across different data

nature methods | VOL.13 NO.3 | MARCH 2016 | 211

npg

© 2016 Nature America, Inc. All rights reserved.

news and views sets (reviewed in ref. 5). These linkage attacks are relevant when phenotypic and genomic data sets are split to aid de-identification, or when a person has enrolled in multiple studies unbeknownst to researchers. Gymrek and colleagues provided one example by identifying several male participants of the 1000 Genomes Project using their Y-chromosome markers, their family structure and online genetic genealogy databases6. Harmanci and Gerstein build significantly on a 2012 publication by Schadt and colleagues7, whose work originally showed that eQTL analysis can be used to link genotype data to individual expression profiles and their clinical entries (such as found in the US National Institutes of Health’s Gene Expression Omnibus database). A significant and important aspect of this new work is the use of extreme values (termed extremity) for linking individuals across data sets. The authors demonstrate an attack by examining a scenario in which an individual is in both an eQTL phenotype database and a genotype database of germline variation, using a third resource to predict genotypes from phenotypes (Fig. 1). This third resource can be thought of as a Rosetta stone for connecting expression and genotype and does not necessarily need to include the individual. The authors show that extreme values in the eQTL phenotype database could be used to better predict informative genotypes. The predicted genotypes were then used to identify the person in the seemingly unrelated genotype database. Considering that a few dozen genetic markers uniquely identify any person, reconnecting genotype data to sensitive phenotypes, such as HIV status, could represent a significant privacy breach. Going well beyond anecdotal examples, their publication provides a detailed and thorough theoretical examination of how privacy leakage can be quantified. This analysis will undoubt-

edly provide an important framework for future privacy risk management in genomics. This article is particularly timely amidst other advances in privacy protection. For example, what is possibly the first practical clinical application of homomorphic encryption was recently reported in Genetics in Medicine by McLaren and colleagues8. Homomorphic encryption allows the linking of calculations on encrypted data without exposing the underlying data through decryption. The first plausible approach for fully homomorphic encryption was described only recently in Gentry’s 2009 PhD thesis9. Until very recently, homomorphic encryption was thought to be too computationally intensive for practical application. However, in the recent work by McLaren et al.8, the authors demonstrated the computation of various genetic tests on encrypted data for 230 HIVpositive individuals, such as Abacavir hypersensitivity on the HLA-B*57 alleles. In the face of genomic linkage attacks, one can anticipate considerable discussion about the role of homomorphic encryption of shared genomic data sets linked to sensitive clinical data. These recent articles are particularly timely given the ongoing public debate on the regulations governing data sharing and consent. Research participants have a right to a reasonable expectation of privacy, and federal laws such as the Common Rule were developed to strike a balance between maintaining privacy and advancing medical research. Clearly, medical research and genomics have evolved faster than their governing regulatory framework. The US Department of Health and Human Services (HHS) recently proposed substantial changes in Common Rule execution, requiring consent for all biospecimens and allowing broad consent for future research and data. The proposed HHS changes acknowledge that DNA is indeed identifiable

212 | VOL.13 NO.3 | MARCH 2016 | nature methods

and that individuals should give informed consent if their tissue is to be used in research and later linked to sensitive information that is available publically10. As efforts such as the Precision Medicine Initiative will increase the amount and extent of genomics data linked to phenotypic data, linkage is becoming a core concept, and a perfect storm of discovery awaits. Linkage of publically shared clinical and genetic data will undoubtedly lead to unexpected biomedical breakthroughs and form a foundation for precision medicine. Repeatedly, we observe that concepts leading to discovery are intertwined with privacy risks; Harmanci and Gerstein’s work reinforces the fact that linkage of clinical and genotypic data is no exception, and that this risk will be compounded as the number of high-dimensional databases grows. As the debate over changes to the Common Rule unfolds, it is increasingly clear that future biomedical research will depend on broadly consenting patients who are educated on how public data sharing both leads to biomedical breakthroughs and affects their own privacy risks through concepts such as linkage. COMPETING FINANCIAL INTERESTS The author declares no competing financial interests. 1. Harmanci, A. & Gerstein, M. Nat. Methods 13, 251–256 (2016). 2. Church, G. et al. PLoS Genet. 5, e1000665 (2009). 3. Homer, N. et al. PLoS Genet. 4, e1000167 (2008). 4. Im, H.K., Gamazon, E.R., Nicolae, D.L. & Cox, N.J. Am. J. Hum. Genet. 90, 591–598 (2012). 5. Erlich, Y. & Narayanan, A. Nat. Rev. Genet. 15, 409–421 (2014). 6. Gymrek, M., McGuire, A.L., Golan, D., Halperin, E. & Erlich, Y. Science 339, 321–324 (2013). 7. Schadt, E.E., Woo, S. & Hao, K. Nat. Genet. 44, 603–608 (2012). 8. McLaren, P.J. et al. Genet. Med. doi:10.1038/ gim.2015.167 (14 January 2016). 9. Gentry, C. in Proc. 2009 ACM International Symposium on Theory of Computing 169–178 (ACM, 2009). 10. Hudson, K.L. et al. N. Engl. J. Med. 373, 2293–2296 (2015).

Understanding the links between privacy and public data sharing.

Understanding the links between privacy and public data sharing. - PDF Download Free
201KB Sizes 1 Downloads 9 Views