Europe PMC Funders Group Author Manuscript Nat Methods. Author manuscript; available in PMC 2017 August 01. Published in final edited form as: Nat Methods. 2017 August ; 14(8): 775–781. doi:10.1038/nmeth.4326.

Europe PMC Funders Author Manuscripts

The Image Data Resource: A Bioimage Data Integration and Publication Platform Eleanor Williams#1,2, Josh Moore#1, Simon W. Li#1, Gabriella Rustici1, Aleksandra Tarkowska1, Anatole Chessel3,6, Simone Leo1,5, Bálint Antal3, Richard K. Ferguson1, Ugis Sarkans2, Alvis Brazma2, Rafael E. Carazo Salas3,4, and Jason R. Swedlow1,7 1Centre

for Gene Regulation & Expression & Division of Computational Biology, University of Dundee, Dundee, Scotland, UK

2European

Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, United

Kingdom 3Pharmacology

& Genetics Departments and Cambridge Systems Biology Centre, University of Cambridge, Cambridge, UK

4School

of Cell and Molecular Medicine, University of Bristol, Bristol, UK

5Center

for Advanced Studies, Research, and Development in Sardinia (CRS4), Pula(CA), Italy

6LOB,

#

Ecole Polytechnique, CNRS, INSERM, Université Paris-Saclay, Palaiseau, France

These authors contributed equally to this work.

Europe PMC Funders Author Manuscripts

Abstract Access to primary research data is vital for the advancement of science. To extend the data types supported by community repositories, we built a prototype Image Data Resource (IDR) that collects and integrates imaging data acquired across many different imaging modalities. IDR links data from several imaging modalities, including high-content screening, super-resolution and timelapse microscopy, digital pathology, public genetic or chemical databases, and cell and tissue phenotypes expressed using controlled ontologies. Using this integration, IDR facilitates the 7

Correspondence to: [email protected], phone +44 1382 385819, fax +44 1382 388072, URL: www.openmicroscopy.org. Code availability All software for building and running the IDR and reading metadata of the IDR data sets is open source and available at https:// github.com/IDR and https://github.com/openmicroscopy. The custom scripts used to combine metadata into a single CSV files are available at https://github.com/IDR/idr-metadata. Data availability All data sets described in this paper are available at https://idr.openmicroscopy.org. Author Contributions J.R.S., A.B. and R.E.C.S conceived and funded the project which was overseen by J.R.S. J.M., S.W.L, A.T., S.L. and E.W. built the IDR software stack: J.M. designed the software architecture and managed the software development team; S.W.L. built all the tools for deploying IDR in the OpenStack cloud and ran all the IDR systems; A.T. built Mapr, the metadata querying application and updated the IDR web stack; S.L. updated Bio-Formats to read the incoming datasets, and along with S.W.L, developed and ran the feature calculation software; E.W. performed all data curation and annotation. G.R., S.W.L. and E.W. sourced and received the datasets. A.C. analyzed features of the integrated datasets. B.A. helped with the IDR/Mineotaur integration. R.K.F. designed the updates to the OMERO UI. U.S. helped with the integration of IDR datasets into BioStudies.

Competing Financial Interests J. R. S is affiliated with Glencoe Software, Inc., an open-source US-based commercial company that provides commercial licenses for OME software.

Williams et al.

Page 2

Europe PMC Funders Author Manuscripts

analysis of gene networks and reveals functional interactions that are inaccessible to individual studies. To enable re-analysis, we also established a computational resource based on Jupyter notebooks that allows remote access to the entire IDR. IDR is also an open source platform that others can use to publish their own image data. Thus IDR provides both a novel on-line resource and a software infrastructure that promotes and extends publication and re-analysis of scientific image data. Much of the published research in the life sciences is based on image datasets that sample 3D space, time, and the spectral characteristics of detected signal to provide quantitative measures of cell, tissue and organismal processes and structures. The sheer size of biological image datasets makes data submission, handling and publication extremely challenging. An image-based genome-wide “high-content” screen (HCS) may contain over a million images, and new “virtual slide” and “light sheet” tissue imaging technologies generate individual images that contain gigapixels of data showing tissues or whole organisms at subcellular resolutions. At the same time, published versions of image data often are mere illustrations: they are presented in processed, compressed formats that cannot convey the measurements and multiple dimensions contained in the original image data and cannot easily be reanalyzed. Furthermore, conventional publications neither include the metadata that define imaging protocols, biological systems and perturbations nor the processing and analytic outputs that convert the image data into quantitative measurements.

Europe PMC Funders Author Manuscripts

Several public image databases have appeared over the last few years. These provide online access to image data, enable browsing and visualization, and in some cases include experimental metadata. The Allen Brain Atlas, the Human Protein Atlas, and the Edinburgh Mouse Atlas all synthesize measurements of gene expression, protein localization and/or other analytic metadata with coordinate systems that place biomolecular localization and concentration into a spatial and biological context1–3. Similarly, many other examples of dedicated databases for specific imaging projects exist, each tailored to its aims and its target community4–8. There are also a number of public resources that serve as true scientific, structured repositories for image data, i.e., that collect, store and provide persistent identifiers for long-term access to submitted datasets, as well as provide rich functionalities for browsing, search and query. One archetype is the EMDataBank, the definitive community repository for molecular reconstructions recorded by electron microscopy9. The Journal of Cell Biology has built the JCB DataViewer (http://jcb-dataviewer.rupress.org/), which publishes image datasets associated with its on-line publications. The CELL Image Library publishes several thousand community-submitted images, some of which are linked to publications10. FigShare stores 2D pictures derived from image datasets, and can provide links for download of image datasets (http://figshare.com). The EMDataBank recently has released a prototype repository for 3D tomograms, the EMPIAR resource11. Finally, the BioStudies and Dryad archives include support for browsing and downloading image data files linked to studies or publications12 (https://datadryad.org/). Some of these provide a resource for a specific imaging domain (e.g., EMDataBank) or experiment (Mitocheck), while others archive datasets and provide links to a related publication available at an external journal’s website (BioStudies). However, no existing resource links several independent biological imaging datasets to provide an “added value” platform, like the

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 3

Expression Atlas achieves for a broad set of gene expression datasets13 and UniProt delivers for protein sequence and function datasets14.

Europe PMC Funders Author Manuscripts

Inspired by these added value resources, we have built a next-generation Image Data Resource (IDR) – an added value platform that combines data from multiple independent imaging experiments and from many different imaging modalities, integrates them into a single resource, and makes the data available for re-analysis in a convenient, scalable form. IDR provides, for the first time, a prototyped resource that supports browsing, search, visualization and computational processing within and across datasets acquired from a wide variety of imaging domains. For each study, metadata related to the experimental design and execution, the acquisition of the image data, and downstream interpretation and analysis are stored in IDR alongside the image data and made available for search and query through a web interface and a single API. Wherever possible, we have mapped the phenotypes determined by dataset authors to a common ontology. For several studies, we have calculated comprehensive sets of image features that can be used by others for re-analysis and the development of phenotypic classifiers. By harmonizing the data from multiple imaging studies into a single system, IDR users can query across studies and identify phenotypic links between different experiments and perturbations.

Results Current IDR

Europe PMC Funders Author Manuscripts

IDR is currently populated with 24 imaging studies comprising 35 screens or imaging experiments from the biological imaging community, most of which are related to and linked to published works (Table 1). IDR holds ~42 TB of image data in ~36M image planes and ~1M individual experiments, and includes all associated experimental (e.g., genes, RNAi, chemistry, geographic location), analytic (e.g., submitter-calculated image regions and features), and functional annotations. Datasets from studies in human, mouse, fly, plant and fungal cells are included. Examples of imaging modalities and experimental approaches so far supported include super resolution 3DSIM and dSTORM; high content chemical and siRNA screening, whole slide histopathology imaging, and live imaging of human and fungal cells and intact mice. Finally, imaging data from Tara Oceans, a global survey of plankton and other marine organisms, is also included. The current collection of datasets samples a variety of biomedically-relevant biological processes like cell shape, division and adhesion, at scales ranging from nanometer-scale localization of proteins in cells to millimeter-scale structure of tissues from animals (Table 2). Genetic, Chemical and Functional Annotation in IDR To enable querying across the different datasets stored in IDR, we have included annotations describing experimental perturbations (genetic mutants, siRNA targets and reagents, expressed proteins, cell lines, drugs, etc.) and phenotypes declared by the study authors either from quantitative analysis or visual inspection of the image data. Wherever possible, experimental metadata in IDR link to external resources that are the authoritative resource for those metadata (Ensembl, NCBI, PubChem, etc.).

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 4

Europe PMC Funders Author Manuscripts

The result is that IDR provides a sampling of phenotypes related to experimental perturbations across several independent studies. Many of the studies in IDR perturb gene function by mutation or siRNA depletion. To calculate the sampling of gene orthologues, we used Ensembl's BioMart resource15 to access a normalized list of gene orthologues. Overall, 19,601 gene orthologues are sampled, and 84.1% of gene orthologues are sampled more than 20 times. 90.3% of gene orthologues are sampled in three or more studies, so even in this early incarnation the phenotypes of perturbations in the majority of known genes are sampled in several different assays and organisms.

Europe PMC Funders Author Manuscripts

We have normalized the phenotypes included in studies submitted to IDR. Functional annotations (e.g., “increased peripheral actin") have been converted to defined terms in the Cellular Microscopy Phenotype Ontology (CMPO)16, or other ontologies in collaboration with the data submitters. Overall, 88% of the functional annotations have links to defined, published controlled vocabularies. 158 different ontology-normalized phenotypes (e.g., "increased number of actin filaments", "mitosis arrested") are included in IDR, and 136 are reported by authors in only one study. Nonetheless, these phenotypes are well-sampled--the mean number of samples per phenotype, across HCS and other imaging datasets is 698 and the median is 144. This skewing occurs because some phenotypes are very common or are over-represented in specific assays, e.g. “protein localized in cytosol phenotype”, (CMPO_0000393; https://idr.openmicroscopy.org/mapr/phenotype/? value=CMPO_0000393). Nonetheless, there are several cases where phenotypes are observed in multiple orthogonal assays. Two examples are the “round cell” phenotype (CMPO_0000118; https://idr.openmicroscopy.org/mapr/phenotype/? value=CMPO_0000118) and the "increased nuclear size" phenotype (CMPO_0000140; https://idr.openmicroscopy.org/mapr/phenotype/?value=CMPO_0000140). Figure 1 summarizes the sampling of phenotypes across the current IDR datasets. Several classes of phenotypes are included, and many cases are sampled in thousands of experiments. In total, IDR includes >1M individual experiments (Table 1), with ~9 % annotated with experimentally observed phenotypes. Data Visualization in IDR IDR integrates image data and metadata from several studies into a single resource. The current IDR web user interface (WUI) is based on the open source OMERO.web application17 supplemented with a plugin allowing datasets to be viewed by ‘Study’, ‘Genes’, ‘Phenotypes’, ‘siRNAs’, ‘Antibodies’, ‘Compounds’, and ‘Organisms’ (see Supplementary Note). Using this architecture makes the integrated data resource available for access and re-use in several ways (see Supplementary Note). Image data are viewable as thumbnails for each study and multi-dimensional images can be viewed and browsed. Tiled whole slide images used in histopathology are also supported. Where identified regions of interest (ROIs) have been submitted with the image data, these have been included and linked, and where possible, made available through the IDR WUI. IDR images, thumbnails and metadata are accessible through the IDR WUI and web-based API in JSON format (see Supplementary Note). They also can be embedded into other pages using the OMERO.web gateway (e.g., https://www.eurobioimaging-interim.eu/image-data-repository.html).

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 5

Standardized Interfaces for Imaging Metadata

Europe PMC Funders Author Manuscripts

IDR integrates imaging data from many different, independent studies. These data were acquired using several different imaging modalities, in the absence of any over-arching standards for experimental, imaging or analytic metadata. While efforts like MIACA (http:// miaca.sourceforge.net/), NeuroVault18, MULTIMOT19 and several other projects have proposed data standards in specific imaging subdomains, there is not yet a metadata standard that crosses all of the imaging domains potentially served by IDR. We therefore sought to adopt lightweight methods from other communities that have had broad acceptance20 and converted metadata submitted in custom formats – spreadsheets, PDFs, MySQL databases, and Microsoft Word documents -- into a consistent tabular format inspired by the MAGETAB and ISA-TAB specifications21, 22 that could then be used for importing semistructured metadata like gene and ontology identifiers into OMERO23. We also used the Bio-Formats software library to identify and convert well-defined, semantically-typed elements that describe the imaging metadata (e.g., image pixel size) as specified in the OME Data Model24, 25. The resulting translation scripts were used to integrate datasets from multiple distinct studies and imaging modalities into a single resource. The scripts are publicly available (see Online Methods) and thus comprise a framework for recognizing and reading a range of metadata types across several imaging domains into a common, open specification. Added Value of IDR

Europe PMC Funders Author Manuscripts

Because IDR links gene names and phenotypes, query results that combine genes and phenotypes across multiple studies are possible through simple text-based search. Searching for the gene SGOL1 (https://idr.openmicroscopy.org/mapr/gene/?value=SGOL1) returns a range of phenotypes from four separate studies associated with mitotic defects (for example, CMPO_0000118, CMPO_0000305, CMPO_0000212, CMPO_0000344, etc.)4, 26 but also an accelerated secretion phenotype (CMPO_0000246) in a screen for defects in protein secretion27. A second example is provided in a histopathology study of tissue phenotypes in a series of mouse mutants. Knockout of carbonic anhydrase 4 (Car4; https:// idr.openmicroscopy.org/mapr/gene/?value=Car4) in mouse results in a range of defects in homeostasis in the brain, rib growth and male fertility28–30. Data held in IDR show abnormal nuclear phenotypes in several tissues from Car4-/- mice (e.g., GI: https:// idr.openmicroscopy.org/webclient/?show=dataset-153; liver: https://idr.openmicroscopy.org/ webclient/?show=image-1918940; male reproductive tract: https://idr.openmicroscopy.org/ webclient/?show=image-1918953). The human orthologue, CA4, is involved in certain forms of retinitis pigmentosa31, 32. Data presented in IDR from the Mitocheck study show that siRNA depletion of CA4 in HeLa cells4 also results in abnormally shaped nuclei (https://idr.openmicroscopy.org/webclient/?show=well-828419) consistent with a defect in some aspect of the cell division cycle. Phenotypes across distinct studies can also be used to build novel representations of gene networks. Figure 2A shows the gene network created when the gene knockouts or knockdowns that caused an elongated cell phenotype (CMPO_0000077) in studies in S. pombe and human cells are linked by queries to String DB33 and visualized in Cytoscape34 (see Supplementary Note and Supplementary Table 1). The genes discovered in the three

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 6

Europe PMC Funders Author Manuscripts

studies form interconnected, non-overlapping, complementary networks that connect specific macromolecular complexes to the elongated cell phenotype. For example, HELZ2, MED30, MED18 and MED20 are all part of the Mediator Complex, but were identified as “elongated cell” hits in separate studies using different biological models (idr0001-A, idr0008-B, idr0012-A, Figure 2B). POLR2G (from idr0012-A), PAF1 (from idr0001-A) and SUPT16H (from idr0008-B) were scored as elongated cell hits in these studies and are all part of the Elongation complex in the RNA Polymerase II transcription pathway. Finally, ASH2L (“elongated cell phenotype” in idr0012-A), associates with SETD1A and SETD1B (“elongated cell phenotype” in idr0001-A) to form the Set1 histone methyltransferase (HMT). These examples demonstrate that these individual hits are probably not due to offtarget effects or characteristics of individual biological models but arise through conserved, specific functions of large macromolecular complexes. This shows the utility and importance of combining phenotypic data of studies from different organisms and scales, and of integrating metadata from independent studies, to generate added value that can enhance the understanding of biological mechanisms and lead to new mechanistic hypothesis and predictions.

Europe PMC Funders Author Manuscripts

The integration of experimental, image and analytic metadata also provides an opportunity to include new functionalities for more advanced visualization and analytics of imaging data and metadata, bringing further added value to the original studies and datasets. We have added the data analytics tool Mineotaur35 to one of IDR’s datasets (https:// idr.openmicroscopy.org/mineotaur/). This allows visual querying and analysis of quantitative feature data. For instance, having shown that components of the Set1 HMT function in controlling cell morphology in S. pombe and human cells, we noticed that genes like ASH2L were in the “elongated cell” network based on human cell data (idr0012-A) but not S. pombe data, where ash2, the S. pombe ASH2L orthologue, was not annotated as a cell elongation “hit”. We first noted that ash2 has a microtubule cytoskeleton phenotype (https:// idr.openmicroscopy.org/webclient/?show=well-592371). We then queried the criteria previously used for cell shape hits in the Sysgro screen (idr0001-A) and found that ash2 fell just below the cutoff originally used in this study to define phenotypic hits for cell shape (Supplementary Note). When combined with results on ASH2L from HeLa cells (Figure 2B) these results suggest that the Set1 HMT has a strongly conserved role in controlling cell shape and the cytoskeleton in unicellular and multicellular organisms. Data Integration and Access Like most modern on-line resources IDR makes data available through a web user-interface as well as a web-based JSON API. This encourages third-parties to make use of IDR in their own sites. For example, image data in IDR has been linked to study data in BioStudies, thereby extending the linkage of study and image metadata (e.g., https://www.ebi.ac.uk/ biostudies/studies/S-EPMC4704494), and to PhenoImageShare36, an on-line phenotypic repository (e.g., http://www.phenoimageshare.org/search/?term=&hostName=Image+Data +Repository+(IDR)). These are examples of use of IDR as a service that delivers data for other applications to integrate and reuse.

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 7

Europe PMC Funders Author Manuscripts

To add further value and extend the possibilities for reuse of IDR data, we are calculating comprehensive sets of feature vectors of IDR image data using the open source tool WNDCHARM37. To date full WND-CHARM features have been calculated for images in idr0002-A, idr0005-A, idr0008-B, idr0009-A, idr0009-B, idr0012-A, and parts of idr0013-A and idr0013-B. Feature calculations for other IDR datasets are in progress. Features are stored in IDR using OMERO’s HDF5-based data store and available through the OMERO API (see Supplementary Note). The integration of image-based phenotypes and calculated features makes IDR an attractive candidate for computational re-analysis. To ease the access to IDR’s TB-scale datasets, we have connected IDR to a Jupyter notebook-based computational resource (https:// idr.openmicroscopy.org/jupyter) that exposes IDR datasets via an API (https:// idr.openmicroscopy.org/about/api.html). We include exemplar notebooks that provide visualization of image features using PCA, access to images annotated with CMPO phenotypes, calculation of gene networks, calculation of WND-CHARM features for individual images and recreation of Figures 1 and 2 from IDR data. Alternatively, users can run their own analyses using notebooks stored in GitHub (https://github.com/IDR/idrnotebooks). To allow re-use of IDR metadata locally, we have made all IDR databases, metadata and thumbnails available for download and have built Ansible scripts that automate the deployment of the IDR software stack (original image data are not included; see Supplementary Note).

Discussion

Europe PMC Funders Author Manuscripts

Making data public and available is a critical part of the scientific enterprise38 (https:// wellcome.ac.uk/what-we-do/our-work/expert-advisory-group-data-access) (https:// royalsociety.org/topics-policy/projects/science-public-enterprise/report/). To take the next step in facilitating the reuse and meta-analysis of image datasets we have built IDR, a nextgeneration data technology that integrates and publishes image data and metadata from a wide range of imaging modalities and scales in a consistent format. IDR integrates experimental, imaging, phenotypic and analytic metadata from several independent studies into a single resource, allowing new modes of biological Big Data querying and analysis. As more datasets are added to IDR, they will potentiate and catalyze the generation of new biological hypotheses and discoveries. In IDR, we have linked image metadata from several independent studies. Experimental, imaging phenotypic and analytic metadata are recorded in a consistent format. Rather than declaring and attempting to enforce a strict imaging data standard, IDR provides tools for supporting community formats and releases these as a framework that facilitates data reuse. We hope that the availability of this framework will provide incentives for others to structure their metadata in shareable formats that can be read into IDR or other applications, whether based on OMERO or not. In the future, we can imagine that these and other capabilities could be extended in IDR - or similar repositories that link to IDR - to enable systematic integration, visualization and analytics across imaging studies, thereby helping to harness and capitalize on the exponentially increasing amounts of bio-imaging data that the community generates.

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 8

Europe PMC Funders Author Manuscripts

As of this writing, IDR has published 35 reference image datasets grouped into 24 studies (Table 1) and, utilizing EMBL-EBI’s Embassy Cloud, has capacity to receive and publish many more. Authors of scientific publications that are already published or under submission can submit accompanying image datasets for publication in IDR, using the metadata specifications and formats we have built. Once published, the datasets can be browsed and viewed through IDR’s WUI, or queried and re-analyzed using the IDR computational resource. Details about the submission process are available (https:// idr.openmicroscopy.org/about/submission.html). IDR software and technology is open source, so it can be accessed and built into other image data publication systems. This supports the building of technology and installations that integrate and publish bio-image data for the scientific community, allowing discoveries and predictions similar to what we have shown in Figure 2. IDR therefore functions both as a resource for image data publication and as a technology platform that supports the creation of on-line scientific image databases and services. In the future, those databases and services may ultimately amalgamate to form resources analogous to the genomic resources that are the foundation of much of modern biology.

Online Methods Architecture and Population of IDR

Europe PMC Funders Author Manuscripts

IDR (https://idr.openmicroscopy.org) was built using open-source OMERO17 and BioFormats24 as a foundation. Deployments are managed by Ansible playbooks along with reusable roles on an OpenStack-based cloud contained within the EMBL-EBI Embassy resource. Datasets (Table 1) were collected by shipped USB-drive or transferred by Aspera. Included datasets were selected according to the criteria defined by the Euro-BioImaging/ Elixir Data Strategy concept of "reference images" (http://www.eurobioimaging.eu/contentnews/euro-bioimaging-elixir-image-data-strategy), which states that image datasets for publication should be related to published studies, linked as much as possible to other resources and be candidates for re-use, re-analysis, and/or integration with other studies. Experimental and analytic metadata were submitted in either spreadsheets (CSV, XLS), PDF or HDF5 files or a MySQL database, each using its own custom format. We converted these custom formats to a consistent tabular format inspired by the MAGE and ISA-TAB specifications21, 22 and combined into a single CSV file using a custom script (available in https://github.com/IDR/idr-metadata) and imported into OMERO. Imaging metadata and binary data were imported into OMERO using Bio-Formats. Experimental and analytic metadata were stored using OMERO.tables, an HDF5-backed tabular data store used by OMERO. For each dataset, metadata that were valuable for querying and search were copied to OMERO’s key-value-based Map Annotation facility23. This means that different metadata types and elements can be accessed using different parts of the OMERO API, depending on the search and querying capabilities they require. For more information on the construction of queries, see Supplementary Note.

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 9

Supplementary Material Refer to Web version on PubMed Central for supplementary material.

Acknowledgements Europe PMC Funders Author Manuscripts

We thank all the study authors who submitted datasets for inclusion in the IDR for their contributions and help in incorporating their datasets. We also thank the systems support team at EMBL-EBI, in particular Richard Boyce, David Ocana, Charles Short, and Andy Cafferkey for their support of the project’s use of the Embassy Cloud. We are particularly grateful to Simon Jupp for help with adding and defining new ontology terms. The IDR project was funded by the BBSRC (BB/M018423/1) and Horizon 2020 Framework Programme of the European Union under grant agreement No. 688945 (Euro-BioImaging Prep Phase II). Updates to OMERO and Bio-Formats were supported by the Wellcome Trust (095931/Z/11/Z) and Horizon 2020 Framework Programme of the European Union under grant agreement No. 634107 (MULTIMOT). RECS was funded by a BBSRC Responsive Mode grant (BB/K006320/1), a European Research Council Starting Researcher Investigator Grant (SYSGRO) and the University of Bristol.

References

Europe PMC Funders Author Manuscripts

1. Uhlen M, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015; 347:1260419. [PubMed: 25613900] 2. Hawrylycz MJ, et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature. 2012; 489:391–399. [PubMed: 22996553] 3. Armit C, et al. eMouseAtlas, EMAGE, and the spatial dimension of the transcriptome. Mammalian genome : official journal of the International Mammalian Genome Society. 2012; 23:514–524. [PubMed: 22847374] 4. Neumann B, et al. Phenotypic profiling of the human genome by time-lapse microscopy reveals cell division genes. Nature. 2010; 464:721–727. [PubMed: 20360735] 5. Graml V, et al. A genomic Multiprocess survey of machineries that control and link cell shape, microtubule organization, and cell-cycle progression. Dev Cell. 2014; 31:227–239. [PubMed: 25373780] 6. Koh JL, et al. CYCLoPs: A Comprehensive Database Constructed from Automated Analysis of Protein Abundance and Subcellular Localization Patterns in Saccharomyces cerevisiae. G3 (Bethesda). 2015; 5:1223–1232. [PubMed: 26048563] 7. Gonczy P, et al. Functional genomic analysis of cell division in C. elegans using RNAi of genes on chromosome III. Nature. 2000; 408:331–336. [PubMed: 11099034] 8. Fowlkes CC, et al. A quantitative spatiotemporal atlas of gene expression in the Drosophila blastoderm. Cell. 2008; 133:364–374. [PubMed: 18423206] 9. Lawson CL, et al. EMDataBank.org: unified data resource for CryoEM. Nucleic Acids Res. 2011; 39:D456–464. [PubMed: 20935055] 10. Orloff DN, Iwasa JH, Martone ME, Ellisman MH, Kane CM. The cell: an image library-CCDB: a curated repository of microscopy data. Nucleic Acids Res. 2013; 41:D1241–1250. [PubMed: 23203874] 11. Iudin A, Korir PK, Salavert-Torres J, Kleywegt GJ, Patwardhan A. EMPIAR: a public archive for raw electron microscopy image data. Nat Methods. 2016; 13:387–388. [PubMed: 27067018] 12. McEntyre J, Sarkans U, Brazma A. The BioStudies database. Molecular systems biology. 2015; 11:847. [PubMed: 26700850] 13. Petryszak R, et al. Expression Atlas update--an integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Res. 2016; 44:D746–752. [PubMed: 26481351] 14. UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015; 43:D204–212. [PubMed: 25348405] 15. Yates A, et al. Ensembl 2016. Nucleic Acids Res. 2016; 44:D710–716. [PubMed: 26687719] 16. Jupp S, et al. The cellular microscopy phenotype ontology. Journal of biomedical semantics. 2016; 7:28. [PubMed: 27195102]

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 10

Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts

17. Allan C, et al. OMERO: flexible, model-driven data management for experimental biology. Nat Methods. 2012; 9:245–253. [PubMed: 22373911] 18. Gorgolewski KJ, et al. NeuroVault.org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain. Frontiers in neuroinformatics. 2015; 9:8. [PubMed: 25914639] 19. Masuzzo P, et al. An open data ecosystem for cell migration research. Trends Cell Biol. 2015; 25:55–58. [PubMed: 25484346] 20. Brazma A, Krestyaninova M, Sarkans U. Standards for systems biology. Nat Rev Genet. 2006; 7:593–605. [PubMed: 16847461] 21. Sansone SA, et al. Toward interoperable bioscience data. Nat Genet. 2012; 44:121–126. [PubMed: 22281772] 22. Rayner TF, et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics. 2006; 7:489. [PubMed: 17087822] 23. Li S, et al. Metadata management for high content screening in OMERO. Methods. 2015 24. Linkert M, et al. Metadata matters: access to image data in the real world. J Cell Biol. 2010; 189:777–782. [PubMed: 20513764] 25. Goldberg IG, et al. The Open Microscopy Environment (OME) Data Model and XML File: Open Tools for Informatics and Quantitative Analysis in Biological Imaging. Genome Biol. 2005; 6:R47. [PubMed: 15892875] 26. Heriche JK, et al. Integration of biological data by kernels on graph nodes allows prediction of new genes involved in mitotic chromosome condensation. Mol Biol Cell. 2014; 25:2522–2536. [PubMed: 24943848] 27. Simpson JC, et al. Genome-wide RNAi screening identifies human proteins with a regulatory function in the early secretory pathway. Nat Cell Biol. 2012; 14:764–774. [PubMed: 22660414] 28. Shah GN, et al. Carbonic anhydrase IV and XIV knockout mice: roles of the respective carbonic anhydrases in buffering the extracellular space in brain. Proc Natl Acad Sci U S A. 2005; 102:16771–16776. [PubMed: 16260723] 29. Scheibe RJ, et al. Carbonic anhydrases IV and IX: subcellular localization and functional role in mouse skeletal muscle. Am J Physiol Cell Physiol. 2008; 294:C402–412. [PubMed: 18003750] 30. Wandernoth PM, et al. Role of carbonic anhydrase IV in the bicarbonate-mediated activation of murine and human sperm. PLoS One. 2010; 5:e15061. [PubMed: 21124840] 31. Rebello G, et al. Apoptosis-inducing signal sequence mutation in carbonic anhydrase IV identified in patients with the RP17 form of retinitis pigmentosa. Proc Natl Acad Sci U S A. 2004; 101:6617–6622. [PubMed: 15090652] 32. Yang Z, et al. Mutant carbonic anhydrase 4 impairs pH regulation and causes retinal photoreceptor degeneration. Hum Mol Genet. 2005; 14:255–265. [PubMed: 15563508] 33. Szklarczyk D, et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015; 43:D447–452. [PubMed: 25352553] 34. Cline MS, et al. Integration of biological networks and gene expression data using Cytoscape. Nat Protoc. 2007; 2:2366–2382. [PubMed: 17947979] 35. Antal B, Chessel A, Carazo Salas RE. Mineotaur: a tool for high-content microscopy screen sharing and visual analytics. Genome Biol. 2015; 16:283. [PubMed: 26679168] 36. Adebayo S, et al. PhenoImageShare: an image annotation and query infrastructure. Journal of biomedical semantics. 2016; 7:35. [PubMed: 27267125] 37. Orlov N, et al. WND-CHARM: Multi-purpose image classification using compound image transforms. Pattern Recognition Letters. 2008; 29:1684–1693. [PubMed: 18958301] 38. Boulton G, Rawlins M, Vallance P, Walport M. Science as a public enterprise: the case for open data. Lancet. 2011; 377:1633–1635. [PubMed: 21571134] 39. Fuchs F, et al. Clustering phenotype populations by genome-wide RNAi and multiparametric imaging. Molecular systems biology. 2010; 6:370. [PubMed: 20531400] 40. Rohn JL, et al. Comparative RNAi screening identifies a conserved core metazoan actinome by phenotype. J Cell Biol. 2011; 194:789–805. [PubMed: 21893601]

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 11

Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts

41. Breker M, Gymrek M, Schuldiner M. A novel single-cell screening platform reveals proteome plasticity during yeast stress responses. J Cell Biol. 2013; 200:839–850. [PubMed: 23509072] 42. Thorpe PH, Alvaro D, Lisby M, Rothstein R. Bringing Rad52 foci into focus. J Cell Biol. 2011; 194:665–667. [PubMed: 21893595] 43. Toret CP, D'Ambrosio MV, Vale RD, Simon MA, Nelson WJ. A genome-wide screen identifies conserved protein hubs required for cadherin-mediated cell-cell adhesion. J Cell Biol. 2014; 204:265–279. [PubMed: 24446484] 44. Fong KW, et al. Whole-genome screening identifies proteins localized to distinct nuclear bodies. J Cell Biol. 2013; 203:149–164. [PubMed: 24127217] 45. Srikumar T, et al. Global analysis of SUMO chain function reveals multiple roles in chromatin regulation. J Cell Biol. 2013; 201:145–163. [PubMed: 23547032] 46. Doil C, et al. RNF168 binds and amplifies ubiquitin conjugates on damaged chromosomes to allow accumulation of repair proteins. Cell. 2009; 136:435–446. [PubMed: 19203579] 47. Karsenti E, et al. A holistic approach to marine eco-systems biology. PLoS Biol. 2011; 9:e1001177. [PubMed: 22028628] 48. Wawer MJ, et al. Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling. Proc Natl Acad Sci U S A. 2014; 111:10911–10916. [PubMed: 25024206] 49. Breinig M, Klein FA, Huber W, Boutros M. A chemical-genetic interaction map of small molecules using high-throughput imaging in cancer cells. Molecular systems biology. 2015; 11:846. [PubMed: 26700849] 50. Sero JE, et al. Cell shape and the microenvironment regulate nuclear translocation of NF-kappaB in breast epithelial and tumor cells. Molecular systems biology. 2015; 11:790. [PubMed: 26148352] 51. Barr AR, Bakal C. A sensitised RNAi screen reveals a ch-TOG genetic interaction network required for spindle assembly. Sci Rep. 2015; 5:10564. [PubMed: 26037491] 52. Lawo S, Hasegan M, Gupta GD, Pelletier L. Subdiffraction imaging of centrosomes reveals higherorder organizational features of pericentriolar material. Nat Cell Biol. 2012; 14:1148–1158. [PubMed: 23086237] 53. Szymborska A, et al. Nuclear pore scaffold structure analyzed by super-resolution microscopy and particle averaging. Science. 2013; 341:655–658. [PubMed: 23845946] 54. Dickerson D, et al. High resolution imaging reveals heterogeneity in chromatin states between cells that is not inherited through cell division. BMC Cell Biol. 2016; 17:33. [PubMed: 27609610] 55. Pascual-Vargas P, et al. RNAi screens for Rho GTPase regulators of cell shape and YAP/TAZ localisation in triple negative breast cancer. Sci Data. 2017; 4:170018. [PubMed: 28248929] 56. Yang W, et al. Regulation of Meristem Morphogenesis by Cell Wall Synthases in Arabidopsis. Curr Biol. 2016; 26:1404–1415. [PubMed: 27212401]

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 12

Editor’s Summary This Resource describes the Image Data Resource (IDR), a prototype on-line system for biological image data that links experimental and analytic data across multiple datasets and greatly promotes image data sharing and re-analysis.

Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 13

Europe PMC Funders Author Manuscripts Figure 1. Sampling of Phenotypes in the IDR.

Europe PMC Funders Author Manuscripts

The numbers of samples per phenotype. Each sample represents a well from a micro-well plate in a screen or image from a dataset. Wells annotated as controls were not included. User submitted phenotype terms were mapped to the CMPO terms shown here. Colors represent higher-level groupings of phenotype terms. Point size shows the number of studies (group of related screens) each phenotype is linked to points of increasing size representing 1, 2, 3 or 4 studies respectively.

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 14

Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts

Figure 2. Network Analysis of Genes Linked to the Elongated Cell Phenotype in the IDR.

A. Protein-protein interaction network based on the genes linked to the elongated cell phenotype (CMPO_0000077) in three distinct IDR studies. Genes from S. pombe (green, idr0001-A5), HeLa cell morphology (blue, idr0012-A39) and HeLa Actinome (red, idr0008B40) are displayed with linkages (gray) from StringDB33. To enable comparisons in Cytoscape, the human orthologues of S. pombe genes are used for the genes identified in idr0001-A. For more information, see Supplementary Note.

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Williams et al.

Page 15

B. Zoomed view of network in A, with gene names. See Supplementary Note for the list of gene names used in the figure.

Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts Nat Methods. Author manuscript; available in PMC 2017 August 01.

Europe PMC Funders Author Manuscripts Type

1 1

gene deletion screen RNAi screen protein localization screen protein localization screen RNAi screen

RNAi screen gene deletion screen

small molecule screen histopatholo gy of gene knockouts

S. cerevisiae D. melanogaster Human

S. cerevisiae D. melanogaster, Human Human Human

S. cerevisiae Human Human multi-species Human Human

Mus musculus Human Human Human Human

idr0004-thorpe-rad52

idr0005-toret-adhesion

idr0006-fong-nuclearbodies

idr0007-srikumar-sumo

idr0008-rohn-actinome

idr0009-simpson-secretion

idr0010-doil-dnadamage

idr0011-ledesmafernandez-dad4

idr0012-fuchs-cellmorph

idr0013-neumann-mitocheck

idr0015-UNKNOWN-taraoceans

idr0016-wawer-bioactivecompoundprofiling

idr0017-breinig-drugscreen

idr0018-neff-histopathology

idr0019-sero-nfkappab

Nat Methods. Author manuscript; available in PMC 2017 August 01.

idr0020-barr-chtog

idr0021-lawo-pericentriolarmaterial

idr0023-szymborska-nuclearpore

1 1 1

protein localization using 3D-SIM protein localization using dSTORM

1

1

1

2

1

5

1

2

2

1

1

2

1

1

RNAi screen

HCS image analysis

small molecule screen

geographic screen

RNAi screen

RNAi screen

RNAi screen

protein screen

S. cerevisiae

idr0003-breker-plasticity

1

RNAi screen

1

Human

gene deletion screen

idr0002-heriche-condensation

Species

S. pombe

idr0001-graml-sysgro

Study

No. of screens or experiments

524

414

36,960

25,872

899

147,456

869,820

32,776

200,995

45,692

8,957

56,832

397, 056

55,944

3,456

240,848

45, 792

3,765

97,920

1,152

109,728

5D Images

0.0005

0.0003

0.03

0.03

0.27

2.48

3.19

2.49

14.54

0.38

0.4

0.08

3.25

0.12

0.02

1.40

0.14

0.17

0.20

2.10

10.06

Size (TB)

1

1

2

0

48

0

2

0

18

18

1

2

3

46

23

8

1

1

14

2

19

Phenotypes

7

9

241

198

9

1,281

29,542

84

18,393

16,701

5209

18,675

17,960

12,826

377

12,743

13,035

4,195

6,234

102

3,005

Targets

359

414

1,232

2,156

248

36,864

144,000

84

206,592

26,112

8736

56,832

397,056

26,496

1,152

16,224

15,264

4,512

32,640

1,152

18,432

Experiments

53

52

51

50

---

49

48

47

4

39

Under review

46

27

40

45

44

43

42

41

26

5

Reference

The phenotype column contains the number of submitted phenotypes. The number of genes, compounds or proteins identified as targets for analysis is listed in the Targets column and the ‘Experiments’ column lists the number of individual wells in HCS studies or imaging experiments in non-screen datasets.

Table 1

Europe PMC Funders Author Manuscripts

List of Datasets in IDR

Williams et al. Page 16

in situ hybridization

Human

A. thaliana

idr0028-pascualvargas-rhogtpases

idr0032-yang-meristem

Average

Sum

RNAi screen

S. cerevisiae

3D-tracking of tagged chromatin loci

Type

idr0027-dickerson-chromatin

Species

Europe PMC Funders Author Manuscripts

Study

35

1

4

1

105,782

2,538,777

458

155,332

229

5D Images

1.73

42

0.003

0.18

0.03

Size (TB)

9

224

5

9

0

Phenotypes

6,713

161,119

115

170

8

Targets

41,764

1,002,328

115

5544

112

Experiments

Europe PMC Funders Author Manuscripts No. of screens or experiments

56

55

54

Reference

Williams et al. Page 17

Nat Methods. Author manuscript; available in PMC 2017 August 01.

Europe PMC Funders Author Manuscripts https://idr.openmicroscopy.org/webclient/?show=well-469267 https://idr.openmicroscopy.org/webclient/?show=well-547609 https://idr.openmicroscopy.org/webclient/?show=image-820684 https://idr.openmicroscopy.org/webclient/?show=well-37472 https://idr.openmicroscopy.org/webclient/?show=well-45407 https://idr.openmicroscopy.org/webclient/?show=image-648950 https://idr.openmicroscopy.org/webclient/?show=image-3063667 https://idr.openmicroscopy.org/webclient/?show=image-2849866 https://idr.openmicroscopy.org/webclient/?show=image-1821818 https://idr.openmicroscopy.org/webclient/?show=image-1636543 https://idr.openmicroscopy.org/webclient/?show=well-1056578 https://idr.openmicroscopy.org/webclient/?show=well-1029401 https://idr.openmicroscopy.org/webclient/?show=well-1046336 https://idr.openmicroscopy.org/webclient/?show=dataset-369 https://idr.openmicroscopy.org/webclient/?show=well-1024671 https://idr.openmicroscopy.org/webclient/?show=well-1030579 https://idr.openmicroscopy.org/webclient/?show=dataset-51 https://idr.openmicroscopy.org/webclient/?show=dataset-61 https://idr.openmicroscopy.org/webclient/?show=image-2858266 https://idr.openmicroscopy.org/webclient/?show=image-2895051 https://idr.openmicroscopy.org/webclient/?show=image-3125776

idr0005-toret-adhesion

idr0006-fong-nuclearbodies

idr0007-srikumar-sumo

idr0008-rohn-actinome

idr0009-simpson-secretion

idr0010-doil-dnadamage

idr0011-ledesmafernandez-dad4

idr0012-fuchs-cellmorph

idr0013-neumann-mitocheck

idr0015-UNKNOWN-taraoceans

idr0016-wawer-bioactivecompoundprofiling

idr0017-breinig-drugscreen

idr0018-neff-histopathology

idr0019-sero-nfkappab

idr0020-barr-chtog

idr0021-lawo-pericentriolarmaterial

idr0023-szymborska-nuclearpore

idr0027-dickerson-chromatin

idr0028-pascualvargas-rhogtpases

idr0032-yang-meristem

https://idr.openmicroscopy.org/webclient/?show=well-4852

idr0003-breker-plasticity

idr0004-thorpe-rad52

https://idr.openmicroscopy.org/webclient/?show=well-119093

idr0002-heriche-condensation

IDR URL https://idr.openmicroscopy.org/webclient/?show=well-590686

idr0001-graml-sysgro

Study

Exemplar URLs for each of the studies currently deposited in the IDR.

Table 2

Europe PMC Funders Author Manuscripts

List of IDR Datasets URLs

Williams et al. Page 18

Nat Methods. Author manuscript; available in PMC 2017 August 01.

The Image Data Resource: A Bioimage Data Integration and Publication Platform.

Access to primary research data is vital for the advancement of science. To extend the data types supported by community repositories, we built a prot...
NAN Sizes 1 Downloads 4 Views