pii: sp- 00021-15

http://dx.doi.org/10.5665/sleep.5774

REVIEW

Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource Dennis A. Dean, II, PhD1,2; Ary L. Goldberger, MD2,3,4; Remo Mueller, PhD1,2; Matthew Kim, MD1,2; Michael Rueschman, MPH1; Daniel Mobley, RPSGT1; Satya S. Sahoo, PhD5; Catherine P. Jayapandian, PhD5; Licong Cui, PhD5; Michael G. Morrical, RPSGT1; Susan Surovec, BA1; Guo-Qiang Zhang, PhD,6; Susan Redline, MPH1,2,3,7 Brigham and Women’s Hospital, Boston, MA; 2Harvard Medical School, Boston, MA; 3Beth Israel Deaconess Medical Center, Boston, MA; 4WYSS Institute at Harvard University; 5Division of Medical Informatics and Electrical Engineering Computer Science Department, Case Western Reserve University, Cleveland, OH; 6Institute of Biomedical Informatics, University of Kentucky, Lexington, KY; 7Harvard School of Public Health, Boston, MA

1

Professional sleep societies have identified a need for strategic research in multiple areas that may benefit from access to and aggregation of large, multidimensional datasets. Technological advances provide opportunities to extract and analyze physiological signals and other biomedical information from datasets of unprecedented size, heterogeneity, and complexity. The National Institutes of Health has implemented a Big Data to Knowledge (BD2K) initiative that aims to develop and disseminate state of the art big data access tools and analytical methods. The National Sleep Research Resource (NSRR) is a new National Heart, Lung, and Blood Institute resource designed to provide big data resources to the sleep research community. The NSRR is a web-based data portal that aggregates, harmonizes, and organizes sleep and clinical data from thousands of individuals studied as part of cohort studies or clinical trials and provides the user a suite of tools to facilitate data exploration and data visualization. Each deidentified study record minimally includes the summary results of an overnight sleep study; annotation files with scored events; the raw physiological signals from the sleep record; and available clinical and physiological data. NSRR is designed to be interoperable with other public data resources such as the Biologic Specimen and Data Repository Information Coordinating Center Demographics (BioLINCC) data and analyzed with methods provided by the Research Resource for Complex Physiological Signals (PhysioNet). This article reviews the key objectives, challenges and operational solutions to addressing big data opportunities for sleep research in the context of the national sleep research agenda. It provides information to facilitate further interactions of the user community with NSRR, a community resource. Keywords: electrocardiography, electrocardiography, polysomnography, signal processing, spectral analysis, precision medicine, big data Citation: Dean DA, Goldberger AL, Mueller R, Kim M, Rueschman M, Mobley D, Sahoo SS, Jayapandian CP, Cui L, Morrical MG, Surovec S, Zhang GQ, Redline S. Scaling up scientific discovery in sleep medicine: the National Sleep Research Resource. SLEEP 2016;39(5):1151–1164. Significance Analysis of large volumes of data provide opportunities to shift research from a focus on predicting group averages to predicting individual outcomes, as is needed to support the goals of precision medicine. Aggregating polysomnography-derived signal data also provides opportunities to discover and replicate new physiological signatures for disease or disease risk. The National Sleep Research Resource in a National Heart Lung and Blood-funded resource established to enhance the use of research sleep data to promote discovery and generation of novel hypotheses, particularly leveraging information from overnight physiological signals. This article reviews how this Resource is structured, including approaches for annotating and visualizing data, for generating and sharing the results of quantitative signal analysis, and for promoting scientific collaboration.

INTRODUCTION The sleep research community is particularly well poised to benefit from technological advances that allow large amounts of data to be linked to powerful informatics and computational resources that are advocated by “Big Data” initiatives.1–8 The 2011 National Institutes of Health (NIH) Sleep Disorders Research Plan (NSDRP)1 identified a number of priority areas for research that may particularly benefit from access to large datasets and tools for their analysis. These include investigations aimed at attaining a deeper understanding of age- and sex-related changes in sleep and circadian biology; identification of genetic and other risk factors that increase risk of sleep and circadian disorders in specific individuals and subgroups; identification of basic mechanisms for sleep and circadian diseases; and tool development that leads to improved diagnosis and treatment of sleep and circadian disorders.1 The 2014 Sleep Research Society and the American Academy of Sleep Medicine Joint Strategic plan 2 further highlighted the need to address the health and societal effect of sleep deficiency and circadian function, and the need to establish research networks and informatics infrastructure to broadly support the goals of the NSDRP. The development of large, well-annotated data resources in tandem with analytical tools can be anticipated to SLEEP, Vol. 39, No. 5, 2016

further many of these strategic goals through generating new scientific discoveries, creating an opportunity for developing new sleep analytics that are more closely associated with outcome data, supporting training opportunities, and providing a platform for sleep research networks to perform large scale systems level analyses of data stored in sleep studies. The following commentary provides a description of National Sleep Research Resource (NSRR) goals that motivate its structure and content in the context of the national sleep research strategic agenda. We highlight organizational and technical approaches for overcoming some of the challenges in harmonizing and aggregating large and disparate data that are required to scale up scientific discovery in sleep medicine. A glossary of technical terms not commonly used in sleep research is provided in Table 1. In addition, progress in the first 14 months of public availability of the resource is presented for each goal. NSRR OVERVIEW The NSRR (www.sleepdata.org), is a recent initiative designed to address the aforementioned strategic goals. To ensure relevance to the community, the NSRR was designed around specific community needs required to support NIH-mandated data-sharing than enable data aggregation and large-scale

1151

Scaling up Scientific Discovery—Dean et al.

Table 1—National Sleep Research Resource glossary of terms. Informatics Terms Canonical data dictionary

Maps study specific data terms to a standardized set of data definitions.

Cloud solutions

A way to provide networked hardware and applications solutions in a way that is transparent to the user.

Data dictionary

A data dictionary is a collection of data terms associated with a study. Each data term includes a description and a list of valid entry values.

GIT

GIT is a program used to track software versions. Tracking software version information is a recommended reproducible research practice for maintaining bug reports, bug fixes, and revisions.

GITHUB

GITHUB is a web base platform for modifying and sharing source code. In addition to GIT functionality (see GIT entry above), GITHUB includes functionality for allowing multiple people to contribute to a code base while maintaining version control for revisions.

JSON

JavaScript Object Notation is a data format first used to exchange data between web applications. JSON is supported by modern programming languages making it a universal way to exchange information.

Mapping data to definitions

The process of assigning study variable names to terms in the canonical data dictionary

Ontology

A formal definition of a domain, area of interest, described by terms. Terms can include types, properties and interrelationships. Ontological representations are used in information science to create domain aware software and informatics applications.

Provenance

Data provenance is the information required to document the origin and context by which data was collected. Data provenance provides a way for data users to verify that data is interpreted/used appropriately.

Sleep Domain Ontology

An ontology under NSRR development that aims to include key sleep medicine terms including disorder types, physiological data, medications, covariates and other data as required to archive NIH funded sleep studies.

XML

Extensible Markup Language (XML) is a text based language designed to share information across the internet. XML is intended to be both human and computer readable.

XML Schema

XML Schema is a brief description of the content of an XML document (file).

Analysis Terms Fractal scaling

A pattern that repeats at different scales. Fractal scaling has been identified in nature including physiological signals

Lyapunov exponents

Characterize the predictability/chaos of a system.

Multiscale entropy

Method for measuring the complexity of a finite time series

Nonlinear/complexity properties

Nonlinear properties describe a system where the output of the system is not proportional to the inputs. Complexity properties describe the order/disorder of a system.

Physiologic coupling

A process where two physiological signals influence each through a feedback process

Spectral analysis

Spectral analysis is mathematical transformation that converts long times series in to a plot of Amplitude vs. Frequency.

Time irreversibility

An attribute that describes whether the order of points from a stochastic or deterministic process is respectively associated with timing of pass events or the reversed points are described by the same equation.

System/Network Analysis

Network analysis aims to represent a system as a graph. In the context of physiology, network or system analysis aims to identify coupling between systems.

analyses (see Table 2 for examples). The NSRR’s initial focus is to integrate data collected through NIH-funded cohort studies and clinical trials to develop a data resource to support hypothesis generation, discovery, and replication studies and make this resource broadly accessible to the sleep research community. The NSRR will initially populate the resource with reliably scored, well-annotated research sleep studies from 15 major cohorts or clinical trials (~47,000 sleep studies from over 37,000 subjects)9–28 while creating a structure for allowing deposition of data from other sources. NSRR offers a platform to foster a “new” collaborative culture leading to sleep data integration. NSRR is designed with flexibility and scalability needed to accommodate diverse data types and different datasets that originate from different SLEEP, Vol. 39, No. 5, 2016

1152

sources. The platform is able to both support the needs of individual projects/research teams as well as support the migration of individual datasets into a larger data pools capable of supporting new levels of scientific query, such as for: (1) developing sleep normative data across the age spectrum and for group-defined sex and race/ethnicity; (2) identifying novel biomarkers (extracted from time series of multiple physiological signals); and (3) elucidating the relationships between sleep deficiency and a wide variety of health outcomes over the life course. In particular, by making available aggregated data across many sources, there is the opportunity to answer questions that single studies alone cannot because of limited sample size, event frequency, and population diversity. Given the expense and burden of sleep research, a data repository of Scaling up Scientific Discovery—Dean et al.

Table 2—NSRR requirements extrapolated from research community needs. Research Community Needs

NSRR Requirements

Meet NIH mandate for sharing data generated from large projects

Data curation and deposition through the NSRR portal

Search across multiple data sources and retrieving summary information about the types of study and high level counts and statistical charts

Provide robust Query and Search Interface tools

Secure and controlled data access conforming to data use specifications given at the individual project, cohort, institution and investigator levels

Fine-grained role-based access control management interface and audit system, with data access policy governed by the NSRR Academic User Group; User friendly online DAUA; Local IRB approval process as needed

Search and download deidentified datasets to a local computer for off-line analysis, including PSG files

Query and Search Interface with web-based secure down-loading for users with appropriate access privileges

Tools for processing and manipulating PSG files in EDF format

EDF Viewer, EDF Editor and EDF Annotation Translator in Java to standardizing and visualize sleep studies drawn from multiple sources

Access to signal processing and computational methods to classify and extract features based on frequency changes and other parameters, as well as cardiac arrhythmias and heart rate variability

Provide measurements of key parameters from EEG, EOG, EMG, and ECG using open-source codes, ready to be used by investigators in their biostatistical analysis

Self-deployed tools for managing and integrating local research data sources within groups or institutional boundaries

Downloadable, ready-to-deploy data integration and data exploration virtual machine image of the same software framework operating the NSRR portal (everything other than the actual data content)

Standard specification of terminologies, concepts and data elements for sleep research

Sleep Domain Ontology and Sleep Provenance Ontology, open, sharable, and evolving with the latest practice

already collected sleep-related signals can provide a cost-effective resource to catalyze collaborative efforts of investigators and trainees across diverse areas of sleep medicine (epidemiology, physiological, trials, genetics, etc.). The NSRR also is designed with guiding principles developed by national computational research initiatives including the Biomedical Information Science and Technology Initiative,4 the NIH Working Group on Data and Informatics for the Advisory Committee to the Director,6 and the White House “Big Data” initiative.5,29,30 NSRR Objectives and Structure The NSRR is intended to be an expandable resource for sleep researchers and trainees and those in relevant fields such as engineering and epidemiology. Large numbers of well-annotated de-identified sleep studies with a variety of linked clinical, physiological, demographic, and biochemical data are being made available to support hypothesis generation, discovery, and replication studies. The NSRR links unique sleep resources to other national repositories such as the Biologic Specimen and Data Repository Information Coordinating Center Demographics (BioLINCC),31 with a goal of also linking to the National Heart, Lung, and Blood Institute (NHLBI) genetics and genomics data repository, dbGaP.32 The NSRR encourages interactions between NSRR users and developers through a variety of channels, such as the online forum and Email–based support so that community needs are assessed on an ongoing basis and responses are made to reflect changing research priorities. The NSRR also directly links to the NIH-sponsored Research Resource for Complex Signal Analysis (PhysioNet)33,34 to leverage PhysioNet’s computational resources for use in analyses of NSRR deposited datasets. The NSRR objectives were designed to meet a number of needs of the sleep research community (Table 2). In particular, the challenges of combining data across studies and sources are SLEEP, Vol. 39, No. 5, 2016

1153

addressed through the development of a robust semantic infrastructure (including well-defined data dictionaries and provenance documentation) and provision of clearly visualized and searchable study documentation. Cloud-based computing is used to provide flexible storage and access, and open-source tools, including those for normalizing diverse physiological signals, are used to promote community collaboration. A user-friendly data access process is utilized. Key NSRR goals are as follows. Goal 1 (Sleep-Arch)

Create an integrated, expandable, metadata-aware, electronic data library of more than 50,000 de-identified, annotated, and normalized polysomnograms (PSGs), including raw signals, scored annotation files, summary sleep statistics, curated from at multiple large-scale research studies of adults and children.

Goal 2 (Sleep-Port)

Provide a web-based portal to access, search and visualize the data library using a cloud-based platform to host and provide access to the spectrum of NSRR resources in a secure, scalable, and robust manner.

Goal 3 (Sleep-Terms)

Develop and make publicly available a semantic infrastructure (Sleep-Terms) to standardize terminology across the data library, to facilitate data mapping, harmonization and enrichment, and to drive querying and data retrieval functionalities:

Goal 4 (Sleep-Tools)

Make available to the research community the suite of data curation, data integration, and signal processing tools used for creating Sleep-Arch and Sleep-Port to facilitate further offline data analysis and discovery of associations among physiological systems and clinical outcomes by investigators. Scaling up Scientific Discovery—Dean et al.

NSRR goals relative to the national sleep research agenda are shown in Figure 1. Achieving NSRR goals are overseen by a Steering Committee consisting of sleep and cardiovascular researchers, informaticians, epidemiologists, and computer scientists from three collaborating institutions (Brigham and Women’s Hospital; University of Kentucky; and Beth Israel Deaconess Medical Center), and an Academic User Group (AUG), composed of key stakeholders and early adopters. Stakeholders include national leaders in sleep epidemiology and representatives from cohorts/trials represented in the repository. Early adopters are individuals such as trainees who are interested in using early NSRR releases of datasets and tools in order to provide informed feedback to the NSRR team. The AUG provides input on issues of usability, and resource prioritization as well as on regulatory and ethical issues related to data sharing. Workflow is managed by four NSRR working groups: the Sleep-Arch, Sleep-Port, Sleep-Terms, and Sleep-Tools teams. The Sleep-Arch team is responsible for curating the data from multiple sources. The Sleep-Port team is responsible for developing the NSRR backend system architecture and frontend access functionality. The Sleep-Terms team is responsible for developing a robust framework for clarifying variable meaning (metadata framework) to ensure that all data elements are clearly and consistently defined. The metadata framework is composed of core terms that allow individual cohort terms to be mapped in a semantically consistent manner. Such core terms support NSRR’s frontend functionality for searching data across multiple cohorts, even when such data were collected and defined differently. The Sleep-Tools team is charged with developing tools for accessing, visualizing, and harmonizing study data and for conducting signal analysis. Ad hoc working groups are constructed as necessary to address crossteam issues, community needs, and new requirements as they arise. A schema of NSRR is shown in Figure 2. Goal 1 (Sleep Arch): Data Resource Repository NSRR aims to flexibly accommodate the deposition of anonymized sleep and cohort/trial-related data from a wide variety of sources and data structures. To immediately populate the repository, the initial sets of research data being deposited are those that have been collected in collaboration with the Brigham and Women’s Hospital Sleep Reading Center. This includes reliably scored, well-annotated research sleep studies from 15 major cohorts or clinical trials (~47,000 sleep studies from over 37,000 subjects)9–28 conducted since 1994 (Table 3). These studies all used well-defined methods for data collection and quality assurance with all scoring performed by the same Sleep Reading Center with established reliability.35 The sleep data include physiological signals such as electroencephalograph (EEG), electrocardiogram (ECG), chest and abdominal wall motion by inductive plethysmography, CO2 wave form, oronasal airflow, pulse oximetry (SpO2 ), heart rate by ECG, and right/left leg movement. Other data include summary metrics of heart rate, breathing, oxygenation, and brain neurophysiological signals derived from each study along with available demographic, risk factors, outcome, and biochemical SLEEP, Vol. 39, No. 5, 2016

data collected in each study and selected results of quantitative signal analysis. Available outcomes data include measures of clinical events adjudicated by each cohort including incident cardiovascular disease and mortality. Specialized data available in some cohorts include cardiac MRI, adjudicated clinical events (myocardial infarction, heart failure), incident falls and fractures, incident dementia, incident cognitive impairment, endothelial function, maternal and fetal outcomes, vascular stiffness, and 24-h blood pressure. Each dataset is linked online with the original study documentation such as protocol descriptions, key references, operation manuals, and quality control analysis and equipment descriptions. Descriptions of each variable are provided. Histograms of selected variables by age, sex, and race are precalculated and rendered online. Links to freely available tools developed to access, review, visualize, and search NSRR datasets are provided and many are accompanied by tutorials. The NSRR data resource is housed on a secure cloud computing platform that provides application, file, and database hosting services. In the first 14 months of NSRR public access, 14,000 sleep studies from the Sleep Heart Health Study (SHHS), the Childhood Adenotonsillectomy Trial (CHAT), the Heart Biomarkers in Apnea Treatment (HeartBEAT) study, the Cleveland Family Study (CFS), the Study of Osteoporotic Fractures (SOF), and Osteoporotic Fractures in Men Study (MrOS) were made available. Protocol descriptions, study variable definitions, covariates, and outcome data are provided for each of these studies. The SHHS dataset contains the largest number of contributed sleep studies to the resource, with 6,000 studies available for download. Deposited CHAT data include 453 studies from a baseline visit and 407 studies from a follow-up visit. The most recently deposited sleep studies are from the HeartBEAT (317 from the baseline visit and 274 from the follow-up visit); CFS (731 PSG files) studies; and SOF (578 studies). A rich set of variables extracted from the PSG and collected during each study are available. Between 994 and 1,346 variables are available for each PSG study, depending on the cohort. These variables include technician-scored events and automatically computed variables. The automatically computed variables are computed with commercial and custom software. Commonly available PSG variable categories deposited for each cohort include administrative notes on study acquisition and scoring procedures, arousals, heart rate, oxygen saturation, respiratory events (apnea and hypopnea), and sleep architecture (technician-scored sleep stages). All available PSG variables are posted. In addition, results of quantitative analyses of the ECG and EEG signal processing (for heart rate variability and EEG power spectrum) are made available as those are generated. Table 4 lists the number of PSG variables by category for each cohort available. A list of the 2,050 PSG variables with descriptions defined for each study is available in the supplemental material. Posted study data includes between 904 and 3,025 study variables across a wide range of data categories. Demographics, anthropometry, laboratory results, and medical history are a small sample of data categories available. Table 5 lists the number of variables available by category for each cohort.

1154

Scaling up Scientific Discovery—Dean et al.

Figure 1—Addressing elements of the Sleep Research Strategic Plan with the National Sleep Research Resource (NSRR). The NSRR is designed to facilitate research in line with national research initiatives in the areas of clinical effectiveness, identification of physiological mechanisms associated with healthy and disturbed sleep, stratification of public health outcomes analyses, creation of new sleep analytics that include information across multiple physiological systems, and to serve as a focus for the training of interdisciplinary researchers. The NSRR is designed to work with other National Institutes of Health research resources such as the Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) and the Research Resource for Complex Signals (PhysioNet). The NSRR is planning to support the data and analysis needs of individuals, research groups, and collaborations as well as to play a role in the training in the next generation of researchers. The NSRR is actively generating an academic user’s group of innovators and early adopters from the research community to help guide NSRR development and to ensure NSRR remains relevant and responsive to the research community’s current and future needs.

SLEEP, Vol. 39, No. 5, 2016

1155

Scaling up Scientific Discovery—Dean et al.

Figure 2—National Sleep Research Resource (NSRR) overview schematic. The NSRR teams include the Web Access Resource (Port) Team, Data Structure and Standardization of Terminology (Terms) Team, and the Signal Tools Resource (Tool’s) team. Data for each study contain study protocol, overnight polysomnograph, manually and automatically scored annotations, covariates, and outcomes.

Goal 2 (Sleep-Port): Web Access Resource Sleep-Port includes web-based tools to support security needs of the portal (support user authentication/credentialing), provides access to an online Data Access and Use Agreement (DAUA) and to the data library (Sleep-Arch), allows the user the ability to download application tool sets (via Sleep-Tools) and tools for querying across datasets (using customized search tools, http:// SLEEP, Vol. 39, No. 5, 2016

physiomimi.case.edu/),36 and provides access to a User Forum. Users can create an NSRR specific password or authenticate their identity through a Google account. The DAUA is modeled after the one in use by BioLINCC in order to harmonize requests across different NHLBI resources. Regulatory approvals are managed through an online resource that allows the user to submit an approved Institutional Review Board (IRB) 1156

Scaling up Scientific Discovery—Dean et al.

Table 3—Description of datasets for initial inclusion in the National Sleep Research Resource. Cohort/Study

Number of Subjects (Number of PSGs)

Sleep Heart Health Study (SHHS: subsets of ARIC, CHS, FHS, Tucson)

Objective Sleep Data

Main Study Outcomes

5,600 (8,080) 40+ y

Full PSG

Incident cardiovascular disease

Multiethnic Study of Atherosclerosis (MESA)-Sleep Study

2,200 45 to 84 y

Full PSG; actigraphy

Incident cardiovascular disease (including cardiac MRI)

Study of Osteoporotic Fractures in Older Men-Sleep (MrOS)

2,991 (4,452) 65+ y

Full PSG; actigraphy

Incident falls, fractures, and cardiovascular disease

Study of Osteoporotic Fractures in Older Women (SOF)-Sleep)

460 75+ y

Full PSG; actigraphy

Incident dementia, falls and fractures.

Honolulu Asian American Asian Sleep Study (HAASA)

700 85+ y

Full PSG

Incident cognitive impairment

Starr County Community Study

1,200 35+ y

Oximetry; peripheral tonometry; snoring

Endothelial dysfunction, diabetes

Cleveland Family Study

1,600 (3,200) 4 to 96 y

Oximetry; thermistry; chest effort, ECG; Full PSG in n = 700

Genetics of sleep apnea

Cleveland Children’s Sleep and Health Study

850 (1,603) 8 to 19 y

Oximetry; thermistry; NP, RIP; ECG in all; Full PSG and actigraphy on n = 504

Incident obesity and pediatric sleep disorders

Childhood Adenotonsillectomy Study (CHAT)

1,900 (2,300) 5 to 10 y

Full PSG

Sleep apnea treatment effects on cognition, behavior and growth

Hispanic Community Health Study (HCHS/SOL)

15,000 18 to 74 y

Oximetry, NP, snoring, movement; actigraphy on n = 2,000

Diabetes, cardiovascular disease, neurocognition, hearing loss

Nulliparous Outcomes (nuMOM2b)

3,700 (7,000) 13 to 45 y

Oximetry, NP, RIP; ECG

Maternal and fetal outcomes

Heart Biomarkers In Apnea Treatment Study (HeartBEAT)

305 (580) 45 to 75 y

Oximetry, NP, RIP; ECG

Sleep apnea treatment effects on 24-h blood pressure and biomarkers

Best Apnea Interventions In Research (BestAIR)

180 45 to 75 y

Oximetry, NP, Thermistry; RIP; EEG; EOG; EMG; ECG; leg movements

Sleep apnea treatment effects on 24-h blood pressure, vascular stiffness and biomarkers

Apnea, Bariatric Surgery and Sleep Trial (ABC)

80 (160) 18 to 65 y

Full PSG

Treatment effects on 24-h blood pressure, vascular stiffness, biomarkers

ECG, electrocardiogram; EEG, electroencephalogram; EMG, electromyography; EOG, electro-oculogram; NP, nasal pressure; PSG, polysomnogram; RIP, inductance plethysmography.

application from the user’s home institution or by accessing an online IRB application that the user completes and submits for review on an ongoing basis by an internal NSRR review committee. An integrated set of query and search tools is available to assist with report generation, data visualization, exploration, and retrieval of raw physiologic signals and summary statistics. The User Forum is intended as the initial mechanism for users to interact with the NSRR team, and to support communitybased discussion of emerging issues related to sleep signal analysis and sleep data standardization and related topics. Users are encouraged to post questions to the forum that are generally answered by NSRR team members within one day of posting. The primary web user interface is built using a Ruby on Rail application that integrates Turbolinks technology (for optimizing performance) and custom-designed components. The NSRR was opened in April 2014. Over its first 14 months of operation, the NSRR website has registered 784 users, and 11.4 terabytes of data representing 708,570 files have been downloaded. Users come from 45 countries (40% from the US). Thirty-eight user applications were reviewed in SLEEP, Vol. 39, No. 5, 2016

the second quarter of 2015, and 121 data access agreements have been approved to date. Interactions with the User Community have been through email and within the online Forum. Email interactions have been useful for identifying areas for improvement. Email questions and requests are sent to appropriate NSRR team members, discussed at team meetings, and reviewed at monthly NSRR meetings when necessary. The major discussion threads include accessing signal data, accessing annotation, and performing spectral analyses. Goal 3 (Sleep-Terms): Standardized Terminology and Data Structure Resource A canonical data dictionary that maps study specific data terms to a standardized set of definitions has been developed as a means of standardizing NSRR terminology across studies. The Sleep-Terms team works to regularly incorporate an expanding set of core terms drawn from a range of domains prioritized on the basis of their anticipated utility to facilitate within cohort exploratory data analysis and cross-cohort query generation.

1157

Scaling up Scientific Discovery—Dean et al.

Table 4—Number of polysomnography variables available by category for each cohort. CFS CHAT MrOs Variable type n Variable type n Variable type n Administrative 3 Administrative 30 Administrative 3 Arousals 11 Arousals 15 Arousals 11 Events 87 Capnography 27 Events 87 Heart rate 54 Heart rate 57 Heart rate 54 Medical alert 5 Oxygen saturation 206 Medical alert 5 Oxygen saturation 173 Respiratory events (n = 905) Oxygen saturation 173 Respiratory events (n = 884) Apnea (620) 6 Respiratory events (n = 884) 4 Apnea (629) 23 Central 200 Obstructive 206 Unsure 200 Hypopnea 208 Indexes 47 Signal quality 65 Sleep architecture 64 Totals: 1,346

Central 207 Mixed 201 Obstructive 206 Signal quality 68 Sleep architecture 55 Sleep events (109) 19 Limb movements 90

Apnea (629) Central Obstructive Unsure Hypopnea Indexes Signal quality Sleep architecture

SHHS SOF Variable type n Variable type n Administrative 4 Administrative 11 Arousals 11 Arousals 11 Heart rate 48 Heart Rate 52 Medical alert 5 Medical Alert 5 Oxygen saturation 136 Oxygen Saturation 142 Respiratory events (n = 637) Respiratory Events (n = 649) Apnea (n = 400) Apnea (n = 414)

23 200 206 200 208 47 65 64 1,350

1,023

Central Obstructive Hypopnea Indexes Signal quality Sleep architecture

200 200 200 37 83 70

Central Obstructive Hypopnea Indexes Signal Quality Sleep Architecture Sleep Events (50) Limb Movements

994

207 206 206 16 61 57 3 47 1,024

CFS, Cleveland Family Study; CHAT, Childhood Adenotonsillectomy Study; HeartBeat, Heart Biomarkers In Apnea Treatment Study; MrOs, Study of Osteoporotic Fractures in Older Men-Sleep; SHHS, Sleep Heart Health Study; SOF, Study of Osteoporotic Fractures in Older Women.

Table 5—Number of variables available by category for each cohort. CFS Variable type Ankle-arm index Biochemical results

CHAT n

Variable type

HeartBeat n

17

Administrative

38

Administrative

217

Anthropometry

45

Demographics

Brachial reactivity

20

Biochemical testing

87

Ambulatory blood pressure

Cyprus body fat

16

Blood pressure

8

8

Demographics

25

Electrocardiography Lab visit

DNA Demographics

Nitric oxide analysis

Sleep diary Spectral analysis Totals:

Biochemical results

Variable type

SOF n

Variable type

Administrative

25

Administrative

10

217

CVD outcome

38

Anthropometry

6

Brachial reactivity

20

Demographics

11

Endpoints

5

16

Interim

113

Heart rate

9

Biochemical results

31

DNA

Family history

322

Electrocardiogram

40

Demographics

201

Medical history

186

Embletta

253

Electrocardiography

201

Medication

122

Polysomnography

627

Neurocognitive testing

355

EndoPAT

2

Lab visit

627

Questionnaires

332

Spectral analysis

Quality of life

170

Oxygen concentration

7

Nitric oxide analysis

Spectral analysis

167

6

Sleep Surgical information

1,635 16

1,350 233

PAP therapy

23

Nurse’s notes

Seated blood pressure

14

Pharyngometry

Medications

26

Polysomnography

Questionnaires

307

4

Skinfold measurement

49

Sleep diary

167 1,810

Questionnaires Rhinometry

38

Spectral analysis 2,889

904

n

17

27

5

Skinfold measurement

4 109

Ankle-arm index

SHHS n

Cyprus body fat

42

Rhinometry

74

Variable type

14

Pharyngometry Questionnaires

n

Anthropometry

Nurse’s notes Polysomnography

MrOs

Variable type

8

Measurement

25

Medical history

6

1,068 32

Lifestyle Medical History

6 28 1,038 167

42 5 1,350 233 4 38 49 167 3,025

1,908

1,269

CFS, Cleveland Family Study; CHAT, Childhood Adenotonsillectomy Study; CVD, cardiovascular disease; HeartBeat, Heart Biomarkers In Apnea Treatment Study; MrOs, Study of Osteoporotic Fractures in Older Men-Sleep; SHHS, Sleep Heart Health Study; SOF, Study of Osteoporotic Fractures in Older Women.

The canonical data dictionary extends existing data definition files by linking data definition terms with standardized terms. Core terms that specify demographic information, anthropometric parameters, physiologic measurements, medical history elements, sleep study data, and neurocognitive testing results are linked to coded terms listed in established biomedical ontologies including the Systematized Nomenclature for Medicine-Clinical Terms, the FDA Drug Classification system, the Ontology for Biomedical Investigations, and the International SLEEP, Vol. 39, No. 5, 2016

Classification of Sleep Disorders. As each new dataset is added to the NSRR, an external data file is generated that maps core terms listed in the canonical data dictionary to corresponding study variables. As each study variable is mapped, it is annotated with provenance attributes that specify the source of the data, the time point of collection, the method of collection, and the specifics of any equipment used to acquire the data. This approach was adopted after summary investigation of existing ontologies revealed inadequate representation of sleep

1158

Scaling up Scientific Discovery—Dean et al.

medicine terms and concepts. In an effort to address this deficiency, a Sleep Research Ontology is being developed using the standard Web Ontology Language that will be uploaded to the National Center for Biomedical Ontology (BioPortal). As a means to accurately and robustly document thousands of variables across multiple datasets, a tool called Spout has been developed and leveraged in the deployment of new data to the NSRR. Spout is a command-line tool developed using Ruby on Rails programming that allows the user to manage a data dictionary (i.e., variable labels, definitions, calculations, relationships) in a version-controlled and collaborative environment (Git and GitHub.com). Spout also aids in checking for data outliers, ensuring that all variables are covered with JSON definition files, and generating graphs and summary statistics that are made publicly available for users browsing the NSRR. In the first 14 months of public access, the Sleep-Terms team has mapped cohort specific data definition terms and core terms to facilitate multi-cohort queries. The team mapped 2,746 variables across the SHHS, CHAT, HeartBeat, CFS, SOF, and MrOS datasets. Each of the mapped terms can be used to create data queries or to search online study documentation. The core terms represented in the NSRR canonical data dictionary includes 328 terms that include data categories such as clinical drug class, clinical drug component, recording instrument, study information, electrophysiological data, atrial fibrillation status, cardiovascular disease status, physiological data, and recorded anthropometry. Goal 4 (Sleep Tools): Signal Tools Resource Two sets of tools are made available by the Sleep-Tools Resource. The first supports data queries through an online query explorer. The query explorer enables cross-cohort queries that allow the user to construct a query from over 328 core terms. The online tool includes a query construction interface, a dictionary of core terms, a description of queryable datasets and an online tutorial. The ability to search between datasets provides a unique ability to search sleep data across datasets and cohorts. An illustration of the cross-cohort query interface is shown in Figure 3. Given the growing interest in quantitative analysis of complex signals, the NSRR also provides the user community a growing suite of open-source data visualization tools and signal processing tools to expedite research and encourage documentation and sharing of algorithms, including information about the details of the pre-processing (e.g., band-pass filters) and processing methods. To foster a ‘best-practices’ approaches to the analysis of complex signals, the NSRR partners with the PhysioNet Resource (http://www.physionet.org/). This synergy between two NIH-sponsored resources helps disseminate open-source computational tools for the analysis of multimodal signals recorded during sleep studies. The transparency provided by combing open-source datasets with open-source software should help resolve long-standing controversies and promote translational research.33,34,37,38 The European Data Format (EDF) has been selected as the NSRR format for making PSG signal data available39 because options for exporting sleep studies as an EDF file are available from most commercial sleep systems. Challenges associated with EDF file include extensions to the specification that have SLEEP, Vol. 39, No. 5, 2016

not been uniformly adopted40,41 and the development of undocumented commercial EDF variations. The NSRR developed and provides a series of EDF tools (EDF Viewer, Editor, and Annotation Translator) to help the user to normalize the signals and study attributes included in the EDF files generated from PSGs, even in cases when PSGs were collected using different vendors and signal montages, and to scrub all identifying information from the EDF headers. EDF Editor provides the user templates to facilitate EDF header edits of multiple files simultaneously (batch processing). EDF Annotation Translator allows vendor-specific annotation formats to be translated to a standard XML schema, so that scoring annotations for PSGs collected using disparate systems can be shared among investigators. Such tools can markedly facilitate multicenter collaboration by allowing more efficient processing, scoring and archiving of data from multiple laboratories. The Sleep-Tools team is developing computational algorithms for visualizing and analyzing physiological signals stored in sleep study files. Tools for editing artifact and batch processing of heart rate variability42 are available through PhysioNet. An EEG spectral analysis with artifact detection program43,44 is available through the NSRR. In addition, PhysioNet’s extensive signal analysis resources include algorithms to compute ECG-derived respiration45 and to measure nonlinear/complexity properties of heart rate dynamics (including Lyapunov exponents,46 multiscale entropy,47 and fractal scaling48). Furthermore, ongoing research is directed at quantifying properties of complex signals related to time irreversibility52 and physiologic coupling.49,50 In the first 14 months of public access, an integrated EDF Editor, translator, and viewer tool has been adapted from Physio-MIMI that allows an individual to review the consistency and integrity of sleep files (EDF) and annotation files (XML), correct any inconsistencies in the EDF files or de-identified EDF headers, and translate annotations from commercial formats to a common format. The compiled application, source code, and tutorials can be accessed from the NSRR Web Portal. The EDF Viewer can be used independently of the Editor and translation tools to view NSRR sleep (EDF) and annotation (XML) files. A compiled application for Microsoft Windows platforms and MATLAB source code is available from the NSRR website (https://sleepdata.org/tools/edf-viewer). Online visualization of study signals is supported by another newly developed tool, Altamira, which utilizes HTML5 and JavaScript to allow users the opportunity to preview EDF signals in the web browser before downloading these large files. Screen shots of the EDF Editor, translator, viewer tool, and Altamira are shown in Figures S1 and S2 in the supplemental material. An automatic EEG artifact detection and spectral analysis program has been built as part of the Data Access and Visualization for Sleep toolbox. To date, 17,718 sleep studies from six NIH studies (SHHS,9 CHAT,22,23 CFS,18 MESA,10 MrOS,51 and SOF12,13) have been processed with automated artifact detection and spectral analysis software resulting in spectral data for 27,991 EEG signals. Automatically generated spectral output is adjudicated by an experienced technician and problematic studies are reviewed by an international team of experts. Artifact detection and spectral analysis source code, a brief user guide, and a getting started guide can be found online.

1159

Scaling up Scientific Discovery—Dean et al.

Figure 3—Query Explorer allows users to query across study datasets. The Query Explorer is an interface that allows users to query across multiple datasets. The query is built from terms common to the selected databases. Defining a query across datasets requires the users to complete three steps: 1. Select datasets for which to query. 2. Select core terms that are defined across the selected datasets to build the query. 3. Specify ranges for each selected term. Selecting the ‘Query’ button initiates a search across the selected datasets and produces a summary. The numbers of subjects by study that meet the selection criteria are shown in the lower right hand corner.

CHALLENGES AND FUTURE DIRECTIONS The new research landscape emphasizes leveraging complex data in new ways, allowing a shift from predicting group averages to predicting individual outcomes, and creating resources that allow more efficient analysis of outcomes in studies sufficiently powered to provide definitive answers. In this regard, NSRR aims to integrate sleep research community research priorities with “big data” approaches. The NSRR provides user-friendly access to comprehensive physiological signal data and linked clinical data from a large number of studies, including information on disease risk factors and outcomes, cardiovascular and neurocognitive function, and biochemical marker data for children and adults representing diverse backgrounds. Open access to data and tools are designed to support reproducible research initiatives52 and to facilitate individuals and communities to engage in sleep research, including to support initiatives identified by the joint task force of the Sleep Research Society and the American Academy of Sleep Medicine.2 SLEEP, Vol. 39, No. 5, 2016

1160

The creation of a central repository of well-defined sleep studies, including physiological signal data, summary sleep data, and annotations, along with key covariate data, could transform data- sharing approaches across the scientific community. The NSRR aims to enhance the ability of researchers to perform queries that require larger sample sizes and more diverse populations than any one cohort would provide. It also aims to provide tools that would eliminate the need for tedious, manual exploration and recoding of variable names across datasets. The resources for overcoming barriers for easily accessing and analyzing existing sleep data from multiple NIH cohorts are aimed to accelerate sleep research, which has lagged behind other areas due to the relatively smaller amounts of sleep data compared to data collected on other heart, lung, and brain phenotypes. Providing community-wide access to reliably scored sleep data, also accessible as raw physiological signals and quantitatively processed summary data with links to demographic and health data, also aims to accelerate Scaling up Scientific Discovery—Dean et al.

research designed to address the role of sleep and sleep disorders in the pathogenesis of chronic illness, while reducing the cost for such research. The availability of physiological signals contained within PSG studies allows for the development of new analytical/modeling methods that can be validated on in an unprecedented number of subjects that can also include subgroup (age, race, sex, and condition) validation. Developing and maintaining a robust and dynamic resource of the size and scope of NSRR, however, pose significant challenges, and during the initial development and rollout of the NSRR, we faced a number of such challenges. We created an AUG with the explicit goal of ensuring that the community’s needs are the driving force for the resource. However, we recognized that the AUG members were quite varied in their interests and that there was not a sufficient number of individuals poised to “test drive” the informatics platform. We therefore purposively invited additional individuals with specific analytic and signal processing interests, including trainees and junior colleagues of our senior advisors, to constitute an “early adopters” group. To incentivize and facilitate interactions, we provide additional biostatistical support for given projects. We now plan to further expand interaction and feedback through the use of online “challenges” and through periodic summits. A User Forum further encourages interactions with a wider group of individuals. A second challenge is the need to achieve an appropriate balance between data access and confidentiality. The era of “Big Data” occurs at a time when tensions between data confidentiality/privacy and broad data sharing requirements have not been completely resolved. We attempted to address these issues by working closely with our institution’s IRB and leaders from each data source, and by following general procedures established by NHLBI for data sharing. All shared NSRR data have been deidentified or fully anonymized and no patient health information is accessible. To mirror other NHLBI approaches, we developed a DAUA consistent with that used by NHLBI’s BioLINCC resource that could easily be submitted using interactive online tools. Part of the initial DAUA included submission of IRB approval from the user’s institution. We found that requiring each user to obtain IRB approval from their home institution was an impediment for some students and international researchers. We therefore obtained from the Partners HealthCare System’s IRB approval to constitute a local review board and provide online IRB application forms that provide an efficient mechanism for collecting key information needed to address issues related to data integrity and security, thus facilitating review and approval. The online regulatory approval process now streamlines both IRB and DAUA review and approval. Since implementing this, the typical number of data requests per month has more than tripled. As regulatory requirements evolve, these procedures will require updating. The scientific integrity of shared data presents concerns to those initially responsible for collecting and compiling the data, those sharing the data, and those using the data. “Perfect is the enemy of the good” may be a relevant maxim to consider as efforts are implemented to check and share data. Enormous datasets from multiple sources are likely to contain some, although hopefully, rare errors, and no amount of data checking is likely to completely sanitize very large datasets. Also, data SLEEP, Vol. 39, No. 5, 2016

that have been processed, whether it be hand-scored sleep data or processed heart rate variability data, yield values that reflect specific procedures that may or may not be optimal for all purposes. After attempting to implement appropriate “best practices” for data management, processing, and checking, the NSRR team decided that such data can be shared with the appropriate documentation on how data were analyzed and encouraging users to take individual responsibility to ensure that the data are used appropriately for given research questions. Areas of controversy, such as how to best remove artifact or average signals, are identified, with the hope that this will encourage further research at identifying best methods. Although data harmonization or standardization is essential for aggregating data that are derived from different sources, this process also could diminish the integrity of the original data if disparate data fields are combined inappropriately. The NSRR responded to this challenge by maintaining all original data in its native format and then mapping each element to core terms. Quality control of this process is partly handled during a data upload procedure that identifies differences in how a variable is encoded. The user also can utilize a graphical interface to view each term, its derivation (provenance) and numeric distribution that is specific for each dataset, providing user control over selecting terms to harmonize across data sources. Tools were developed to help standardize the display of PSG data. To initially populate NSRR, we selected inclusion of welldefined, large research databases. However, the NSRR infrastructure is designed to scale as both the number of users and the size of the dataset downloads increases. Core web functionality is designed in a modular fashion so that those services (such as cross-cohort queries and online signal visualization) can be moved to alternate servers as required. The use of a virtualized environment will enable migration of the NSRR resource to larger cloud solutions as necessary. Posted NSRR datasets are constructed to overcome known “Big Data” analysis challenges resulting from a fix set of measures obtained in a fix automated way.53,54 The set of measures extracted from PSG signals for each cohort were extracted that allow the Terms Team to map terms from NIH cohort studies to a common set of terms. In cases of missing information, the NSRR team works closely with vendors to define variables generated by commercial software, works with cohort leadership to clarify variable definitions, and has contacted data collection study staff to clarify data provenance information. The NSRR recognizes the current set of variables extracted from PSG studies is only a starting point. Several NSRR initiatives are under way to stimulate development of new quantitatively derived indices that can be extracted from PSG signal data across cohorts. For example, the NSRR has collaborated with a Workshop of the American Thoracic Society to sponsor to a Flow Limitation Challenge, a crowdsourcing challenge that aims to develop a consensus algorithm among sleep professional and signal processing experts for characterizing inspiratory flow limitation or flattening from data recorded on a nasal pressure transducer. The NSRR also has stimulated the development of a Center for Complex Sleep Signal Analysis (CCSSA) to catalyze the development, evaluation and sharing of sleep signal analysis methods. The purpose of the CCSSA

1161

Scaling up Scientific Discovery—Dean et al.

Table 6—List of NSRR generated resources with internet addresses. NSRR Quick Links Home Page

https://sleepdata.org

User Forum

https://sleepdata.org/forum

Cross Dataset Query

http://searchsleepdata.case.edu/tutorial

Multi-modality Multi-Resource Data Integration Environment for Physiological and Clinical Research (PhysioMiMi)

http://physiomimi.case.edu/physiomimi/index.php/Main_Page

NSRR Partners Research Resource of Complex Physiological Signals (Physionet)

http://www.physionet.org/

The Database for Genotypes and Phenotypes (DbGaP)

http://www.ncbi.nlm.nih.gov/gap

Biologic Specimen and Data Repository (BioLINCC)

https://biolincc.nhlbi.nih.gov/home/

Data Access and Analysis Tools EDF Editor and Translator

https://sleepdata.org/tools/edf-editor-and-translator

Data access and Visualization for Sleep Toolbox

https://github.com/DennisDean/DAVS-Toolbox/blob/master/README.md

Spectral Train Fig: An open source EEG spectral analysis pipeline

https://github.com/DennisDean/SpectralTrainFig/blob/master/README.md

Block EDF Viewer: An open source EDF viewer

https://sleepdata.org/tools/edf-viewer

Block EDF Signal Raster View: An opens source tool for checking viewing sleep study contents

http://www.mathworks.com/matlabcentral/fileexchange/46366blockedfsignalrasterview

Block EDF Load: A routine for accessing information stored within a sleep study

http://www.mathworks.com/matlabcentral/fileexchange/42784-blockedfload

Tools Hosted on MATALB Site

http://www.mathworks.com/matlabcentral/fileexchange/?term=authorid:113409

Informatics Links Ruby on Rails: The web application framework used to develop sleepdata.org

http://rubyonrails.org

Turbolinks: A Ruby application that speeds up access to web pages created with Ruby on Rails

https://github.com/rails/turbolinks

Spout: A Ruby application for maintaining and checking a data definition file

https://github.com/sleepepi/spout

Altamira: A Ruby application for displaying signals stored within and EDF file within an internet browser

https://github.com/nsrr/altamira

The European Data Format (EDF) is a file format for storing physiological signals collected during a sleep study.

is to provide a mechanism for enhancing outreach to the community. Among the initial foci are: (1) providing consultative assistance to trainees and investigators in analyzing multimodal PSG time series, which are typically nonstationary and nonlinear; (2) “assaying” priorities for open source signal tool development and refinement; and (3) defining featured areas of interest and controversy that can be used to create “Signal Analysis Challenges” based on real-world data. In a complementary way, the NSRR will function as a platform disseminating complex signals consultative services of the Center, catalyzing cooperative work among investigators with common interests, and posting computational tools developed in response to community needs. Creation of the CCSSA formalizes NSRR collaborations with PhysioNet that have resulted in the initial release of signal processing tools. Success can only be achieved with active community input and ongoing iteration. An NSSR user forum is available to discuss analysis methodology and to begin to develop consensus SLEEP, Vol. 39, No. 5, 2016

on sleep research analysis. We envision a key target audience to be trainees who may especially benefit from ready access to well-documented data sources, and will seek to fully engage them and leverage their insights from active use of data and tools. Exemplars of reproducible research that include methods and analysis source code will be posted. The breath of NSRR data on the website can support individual trainee fellowship application including class projects, short-term undergraduate research experiences, master projects, and doctorate theses as well as grant proposals. The scope of the NSRR requires close project management and coordination to ensure that specific milestones are met. Through an agreement with the funder, NHLBI, we developed 164 quantitative milestones across the four specific aims of the 5-y initial project. This milestone document serves as the framework for reviewing quantitative metrics describing project management and effect. Regular reviews of quantitative and qualitative metrics are needed to allow the team to

1162

Scaling up Scientific Discovery—Dean et al.

make ongoing adjustments to ensure that NSRR meets community needs and responds to technology or regulatory issues. Open Invitation to the Community The NSRR is intended to be a living resource that will adapt to the needs of individual researchers, research teams, and educators. Individuals are encouraged to propose projects that require NSRR data and methods. NSRR data are particularly well suited to enable systems analysis approaches using signal data stored in a PSG. NSRR researchers are encouraged to contribute their methods to the growing pool of tools and quantitative methods. The NSRR team will facilitate community vetting of methods/ algorithms, developing guidelines for applying quantitative methods, and will propose challenge problems that foster development of new quantitative methods required to automatically analyze large numbers of sleep studies. Community summits will be convened to showcase best practices, to introduce new methods, to highlight key findings, to discuss challenges and policies associated with data sharing,55,56 and to serve as a venue for the community to guide NSRR direction and growth. In summary, the NSRR is a community-driven resource intended to provide a cost-effective approach to large-scale sleeprelated research designed to address some of the opportunities and challenges identified in the current “big data” era.57 Advances in information technology allow large disparate data sources to be combined and accessed by broad audiences. In parallel, there are growing scientific imperatives to analyze larger datasets that have the power and dimensionality to address emerging scientific questions that cannot typically be generated by single users. The NSRR provides a set of tools and a framework for bringing the sleep community together to begin to generate the data stores needed to address questions on disease pathogenesis and outcomes, and to develop novel disease signatures (see Table 6 for a summary of resources). As currently composed, it is well suited to address questions regarding associations between sleep exposures and cardiovascular outcomes, for discovering novel signatures from the PSG, and for exploring age- and ethnicity-specific variation in sleep traits. There are thousands of other research and clinical datasets in the community that currently could enhance collaborative sleep research, but are not sufficiently structured or supported to be broadly accessible for data aggregation efforts. Further expanding such efforts to include new data sources, extend data types to include circadian and actigraphy terms and data, and link to biorepositories and genomic datasets (such as dbGAP) are only a few of the next steps needed to continue to move the sleep medicine field forward. Further integrating such efforts into formal BD2K initiative and other big data projects, such as NSF-NIH Interagency Initiative: Core Techniques and Technologies for Advancing Big Data Science and Engineering (BIGDATA) and the DDDAS: Dynamic Data Driven Applications Systems, also may help solidify the Sleep Medicine field’s position in the rapidly evolving discipline of data science. REFERENCES 1. National Institutes of Health. National Institutes of Health Sleep Disorders Research Plan, 2011. https://www.nhlbi.nih.gov/health-pro/ resources/sleep/nih-sleep-disorders-research-plan-2011.

SLEEP, Vol. 39, No. 5, 2016

1163

2. Zee PC, Badr S, Kushida C, et al. Strategic opportunities in sleep and circadian research report of the joint task force of the Sleep Research Society and American Academy of Sleep Medicine. Sleep 2014;37:219–27. 3. Strollo PJ. Embracing change, responding to challenge, and looking toward the future. J Clin Sleep Med 2010;6:312–3. 4. Miller K. Bringing the fruits of computation to bear on human health: it’s a tough job but the NIH has to do it. Biomed Comput Rev 2009:18–28. 5. Office of Science and Technology Policy. Obama administration unveils “big data” initiative: announces $200 million in new R&D investments. Accessed 5/15/2014. Available from: http://www. whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_ release.pdf. 6. DeMets D, Tabak L, Altman R, et al. Data informatics working group: draft report to the advisory committee to the director. Bethesda, MD: National Institutes of Health, June 15, 2012. 7. Margolis R, Derr L, Dunn M, et al. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data. J Am Med Inform Assoc 2014;21:957–8. 8. Ohno-Machado L. NIH’s Big Data to Knowledge initiative and the advancement of biomedical informatics. J Am Med Inform Assoc 2014;21:193. 9. Quan SF, Howard BT, Iber C, et al. The Sleep Heart Health Study: design, rationale, and methods. Sleep 1997;20:1077–85. 10. Bild DE, Bluemke DA, Burke GL, et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am J Epidemiol 2002;156:871–81. 11. Dam T-TL, Ewing S, Ancoli-Israel S, et al. Association between sleep and physical funciton in older men: the MrOs Sleep Study. J Am Geriatr Soc 2008;56:1665–73. 12. Cummings SR, Nevitt MC, Browner WS, et al. Risk factors for hip fracture in white women. N Engl J Med 1995;333:767–73. 13. Claman DM, Redline S, Blackwell T, et al. Prevalence and correlates of periodic limb movements in older women. J Clin Sleep Med 2006;2:438–45. 14. Babar SI, Enright PL, Boyle P, et al. Sleep distrurbances and their corrleates in elderly Japanese American men residing in Hawaii. J Gerontol A Biol Sci Med Sci 2000;55A:M406–M411. 15. Foley DJ, Masaki K, White L, Larkin EK, Monjan A, Redline S. Sleep-disordered breathing and cognitive impariment in elderly Japanese-American men. Sleep 2003;26:596–9. 16. Marmot MG, Syme SL, Kagan A, Kato H, Cohen JB, Belsky J. Epidemiologic studies of coronary heart disease and stroke in Japanese men living in Japan, Hawaii and California: prevalence of coronary and hypertensive heart disease and associated risk factors. Am J Epidemiol 1975;102:514–25. 17. Hanis CL, Ferrell RE, Barton SA, et al. Diabetes among Mexican Americans in Starr County, Texas. Am J Epidemiol 1983;118:659–72. 18. Redline S, Tishler PV, Tosteson TD, et al. The familial aggregation of obstructive sleep apnea. Am J Respir Crit Care Med 1995;151:682–7. 19. Rosen CL, Larkin EK, Kircherner HL, et al. Prevalence and risk factors for sleep-disordered breathing in 8- to 11-year-old children: association with race and prematurity. J Pediatr 2003;142:383–9. 20. Johnson NL, Kirchner HL, Rosen CL, et al. Sleep estimation using writst actigraphy in adolescents with and without sleep disordered breathing: a comparison of three data models. Sleep 2007;30:899–905. 21. Javaheri S, Storfer-Isser A, Rosen CL, Redline S. Sleep quality and elevated blood pressure in adolescents. Circulation 2008;118:1034–40. 22. Redline S, Amin R, Beebe D, et al. The Childhood Adnotonsillectomy Trial (CHAT): rationale, design, and challenges of a randomized controlled trial evaluating a standard surgical procedure in a pediatric population. Sleep 2011;34:1509–17. 23. Marcus CL, Moore RH, Rosen CL, et al. A randomized trial of adenotonsillectomy for childhood sleep apnea. N Engl J Med 2013;368:2366–76.

Scaling up Scientific Discovery—Dean et al.

24. Gallo LC, Penedo FJ, Carnethon M, et al. The Hispanic Community Health Study/Study of Latinos Sociocultural Ancillary Study: sample, design, and procedures. Ethn Dis 2014;24:77–83. 25. Redline S, Sortres-Alvarez D, Laredo J, et al. Sleep-disordered breathing in Hispanic/Latino individuals of diverse background. Am J Respir Crit Care Med 2014;189:335–44. 26. ClinicalTrials.gov. Nulliparous Pregnancy Outcomes Study: monitoring mothers-to-be (nuMoM2b). Accessed 5/16/2014. Available from: http://clinicaltrials.gov/ct2/show/NCT01322529. 27. ClinicalTrials.gov. Heart biomarker evaluation in apnea treatment (HeartBeat). Accessed 5/16/2014. Available from: http://clinicaltrials. gov/ct2/show/NCT01086800. 28. ClinicalTrials.gov. Sleep apnea intervention for cardiovascular disease reduction. Accessed 5/16/2014. Available from: http://clinicaltrials.gov/ ct2/show/NCT01261390. 29. National Sleep Foundation. NSF leads federal efforts in big data. Accessed 5/16/2014. Available from: www.nsf.gov/news/news_summ. jsp?cntn_id=123607. 30. National Institutes of Health. 1000 Genomes Project data available on Amazon Cloud. Accessed 5/16/2014. Available from: www.nih.gov/ news/health/mar2012/nhgri-29.htm. 31. National Heart, Lung, and Blood Institute. BioLINCC: Biologic Specimen and Data Repository Information Coordinating Center. Accessed 06/06/2014. https://biolincc.nhlbi.nih.gov/home/. 32. Mailman MD, Feolo M, Jin Y, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 2007;39:1181–6. 33. Goldberger AL, Amaral LAN, Glass L, et al. PhysioBank, PhysioToolkit, an PhysioNet: components of a new research resource for complex physiologic signals. Circulation 2000;101:e215–e220. 34. Moody GB, Mark RG, Goldberger AL. PhysioNet: a research resource for studies of complex phyiologic and biomedical signals. Comput Cardiol 2000;27:179–82. 35. Redline S, Sanders MH, Lind BK, et al. Methods for obtaining and analyzing unattended polysonomography data for a multicenter study. Sleep Heart Health Research Group. Sleep 1998;21:759–67. 36. Zhang G-Q, Siegler T, Saxman P, et al. VISAGE: a query interface for clinical research. AMIA Jt Summits Transl Sci Proc 2010;2010:76–80. 37. Goldberger AL. Giles F. Filley lecture. Complex systems. Proc Am Throrac Soc 2006;3:467–71. 38. Goldberger AL, Amaral LA, Hausdorff JM, Ivanov PC, Peng C-K, Stanley HE. Fractal dynamics in physiology: alterations with disease and aging. Proc Natl Acad Sci U S A 2002;99:2466–72. 39. Kemp B, Varri A, Rosa AC, Nielsen KD, Gade J. A simple format for exchange of digitized polygraphic recordings. Electroencephalo Clin Neurophysiol 1992;82:391–3. 40. Kemp B, Olivan J. European data format ‘plus’ (EDF+), an EDF like standard format for the exchange of physiological data. Clin Neurophysiol 2003;114:1755–61. 41. Kemp B, Roessen M. European data format now supports video. Sleep 2013;36:1111. 42. Task Force of the European Society of Cardiology the North American Society of Pacing Electrophysiology. Heart rate variability: standards of measurement, physiological interpretation, and clinical use. Circulation 1996;93:1043–65. 43. Rusterholz T, Durr R, Achermann P. Inter-individual differences in the dynamics of sleep homeostasis. Sleep 2010;33:491–8. 44. Buckelmuller J, Landolt H-P, Stassen HH, Acherman P. Trait-like individual differences in the human sleep electroencephlogram. Neuroscience 2006;138:351–6. 45. Lipsitz LA, Hashimoto F, Lubowsky LP, et al. Heart rate and respiratory rhythm dynamics on ascent to hight altitude. Br Heart J 1995;74:390–6. 46. Wolf A, Swift JB, Swinney HL, Vastano JA. Determining Lyapunov exponents from a time series. Physica 16D 1985;16:285–317.

SLEEP, Vol. 39, No. 5, 2016

47. Costa M, Goldberger AL, Peng C-K. Multiscale entropy analysis of biological signals. Phys Rev E Stat Nonlin Soft Matter Phys 2005;71:021906. 48. Peng CK, Buldyrev SV, Havlin S, Simons M, Stanley HE, Goldberger AL. Mosaic organization of DNA nucleotides. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 1994;49:1685–9. 49. Thomas RJ, Mietus JE, Peng CK, Goldberger AL. An electrocardiogram-base technique to assess cardiopulmonary coupling during sleep. Sleep 2005;28:1151–61. 50. Ibrahim LH, Jacono FJ, Patel SR, et al. Heritability of abnormalities in cardiopulmonary coupling in sleep apnea: use of an electrocardiogram-based technique. Sleep 2010;33:643–6. 51. Mehra R, Stone KL, Blackwekk T, et al. Prevalence and correlates of sleep-disordered breathing in older men: Osteoporotic Fractures in Men Sleep Study. J Am Geriatr Soc 2007;55:1356–64. 52. Giaretta D, all partners. D11.3 Report on a common vision of digital preservation: progress to year 3, 2013. Report No.: APARSENREP-D11_3-01-1_1. 53. Jee K, Kim G-H. Potentiality of big data in the medical sector: focus on how to reshape the healthcare system. Healthc Inform Res 2013;19:79–85. 54. Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev 2014;1:293–314. 55. Mello MM, Francer JK, Wilenzick M, Teden P, Bierer BE, Barnes M. Preparing for responsible sharing of clinical trial data. N Engl J Med 2013;369. 56. Solomon AC, Hill R, Janssen E, Sanders SA, Heiman JR. Uniqueness and how it impacts privacy in health related social science datasets. ACM International Health Informatics Symposium (IHI 2012); January 28-30, 2012; Miami, Florida. 57. Sorlie PD, Bild DE, Lauer MS. Cardiovascular epidemiology in a changing world - Challenges to investigators and the National Heart, Lung, and Blood Institute. Am J Epidemiol 2012;175:597–601.

ACKNOWLEDGMENTS The authors thank our Academic User Group for providing direction and suggestions that have greatly improved the NSRR. Current AUG members include Dr. Jiawen Cai, Dr. Florian Chapotot, Dr. Nalaka Gooneratne, Dr. Daniel Gottlieb, Dr. Craig Johnson, Dr. Paul Peppard, Dr. Katie Stone, Dr. Simon Warby, and Dr. James Wilson. We would like to acknowledge out Steering Committee for providing day to day oversight. Our Steering Committee members include Dr. Ary L. Goldberger, Dr. Matthew Kim, Dr. Emily Kontos, Dan Mobley, Dr. Susan Redline, Dr. Remo Mueller, Susan Surovec, and Dr. Guo-Qiang Zhang. The National Sleep Research Resource would not exist without the efforts of our core team members which include Michael Cailler, Kevin Gleason, Farhad Kafashi, Dr. Kenneth A. Loparo, Gang Shu, Tricia Tiu, Rui Wang, and Wei Wang.

SUBMISSION & CORRESPONDENCE INFORMATION Submitted for publication January, 2015 Submitted in final revised form December, 2015 Accepted for publication January, 2016 Address correspondence to: Susan Redline, MD, MPH 221 Longwood Ave, room 225, Boston, MA. [email protected]

DISCLOSURE STATEMENT This was not an industry supported study. The work presented in this paper was funded by: NHLBI R24 HL114473, N01-HC-95169, NIH 1R01HL083075-01, R01HL098433, R01HL098433-02S1, 1U34HL105277-01, 1R01HL110068-01A1, 1R01HL113338-01, R21HL108226, P20NS076965, R01HL109493, R01GM104987-07, R01GM104987-07 and a research agreement with the Emma B. Bradley Hospital/Brown University supported by the Periodic Breathing Foundation. Dr. Redline has received research support from ResMed Foundation and has received the use of equipment from Philip Respironics and Resmed. The other authors have indicated no financial conflicts of interest. With the exception of Susan Surovec, all authors participated in study design and manuscript preparation.

1164

Scaling up Scientific Discovery—Dean et al.

Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource.

Professional sleep societies have identified a need for strategic research in multiple areas that may benefit from access to and aggregation of large,...
1MB Sizes 0 Downloads 7 Views