Genetic Epidemiology

EDITORIAL The Critical Need for Computational Methods and Software for Simulating Complex Genetic and Genomic Data Jason H. Moore

Department of Genetics, Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, NH 03756, United States of America Received 7 November 2014; accepted revised manuscript 11 November 2014. Published online 4 December 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/gepi.21878

The development of novel computational, mathematical, and statistical methods for the genetic analysis of complex traits such as disease susceptibility and drug response is more important than ever given the size and complexity of genetic, genomic, and clinical data. Critical to the development of any new method is the plan for evaluation. Does the method produce good models? Is the method computationally efficient? Are the results interpretable? Is the method accessible to others? Does the method perform better in some circumstances than other widely used methods? This latter question is ideally addressed early in the development process using simulated data where the answer is known. Ideally the developer would simulate a wide range of different types of data with varying levels of noise and different types of signals. The goal of the simulation should be to produce a sufficient quantity and diversity of data to determine the strengths and weaknesses of the new method and whether it makes a complementary contribution to the existing methodological toolbox. The quality of this inference will of course depend on the assumptions made during the data simulation and how accurately the data mimic what occurs in nature. As we learn more about the complexity of the human genome, and how nucleotide variation impacts traits through a hierarchy of molecular and physiological systems, we need to concurrently adapt our simulation methods to embrace this complexity. Only then will we be sure our new analytical tools are ready for working with real data. The papers in this special issue are the result of a National Cancer Institute (NCI) sponsored workshop entitled “Genetic Simulation Tools for Post-Genome Wide Association Studies of Complex Diseases” at the National Institutes of Health (NIH) in Bethesda, Maryland on March 11–12, 2014. A full meeting report by Chen et al. is included in the special issue and highlights several important challenges including the simulation of whole genome sequence data, providing standards and improved documentation for simulation software, and encouraging the simulation community to work together. Before these challenges can be tackled there must Correspondence to: Jason H. Moore, Ph.D., HB7937, One Medical Center, Dr. Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA. E-mail: Jason. [email protected]; Phone: 603-653-9939

be an accounting of existing simulation methods and software. To this end, Peng et al. in this special issue have created the genetic simulation resource (GSR) website that allows developers to register their software with information about the their features. Nearly 100 different simulation software packages for a wide variety of different types of data have been included in this resource. This is an important first step toward making it easy for those in need of simulated data to quickly identify what software is available and to compare them based on criteria such as the existence of documentation. The remaining four papers in the special issue focus on a variety of new simulation methods. The paper by Chung et al. focuses on simulating correlated quantitative traits in pedigrees. Their SeqSIMLA approach specifically takes into account shared environmental effects. The paper by Peng introduces variant simulation tools (VST) for simulating genetic variants in next-generation sequence data using forward-time simulation. The VST approach specifically considers the functional effects of both synonymous and nonsynonymous variants in different gene regions. The paper by Uricchio et al. present a forward-time simulation approach for generating DNA sequence data that takes into account evolutionary forces such as demographic events and natural selection. They show how this approach is particularly useful for simulating rare variants. The paper by Moore et al. introduces a biology-based method for simulating genotypephenotype relationships that include hierarchical gene–gene interactions or epistasis. The goal of this approach is to mimic the hierarchical complexity of common human diseases. Each of these papers has the same focus of improving the biological realism in the simulated data. These studies and others represent the first step toward generating complex data that can test our analysis methods and ready them for realities of human genetics and genomics. Acknowledgments The work was supported by NIH grants EY022300, LM009012, LM010098, LM011360, GM097765, and AI59694. We would like to thank the participants of an NIH workshop on the simulation of genetic data for their stimulating feedback and discussion that helped formulate some of the ideas in this paper.  C 2014 WILEY PERIODICALS, INC.

The critical need for computational methods and software for simulating complex genetic and genomic data.

The critical need for computational methods and software for simulating complex genetic and genomic data. - PDF Download Free
73KB Sizes 3 Downloads 3 Views