Proteomics 2014, 14, 2363–2368

DOI 10.1002/pmic.201470164

2363

Meeting New Challenges: The 2014 HUPO-PSI/COSMOS Workshop 13–15 April 2014, Frankfurt, Germany

Sandra Orchard1 , Juan Pablo Albar2 , Pierre-Alain Binz3 , Carsten Kettner4 , Andrew R. Jones5 , Reza M. Salek1 , Juan Antonio Vizcaino1 , Eric W. Deutsch6 and Henning Hermjakob1 1

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK 2 ProteoRed, Centro Nacional de Biotecnolog´ıa-CSIC, Madrid, Spain 3 Centre Hospitalier Universitaire Vaudois (CHUV), Lausanne, Switzerland 4 STRENDA, Beilstein-Institut, Frankfurt, Germany 5 Institute of Integrative Biology, University of Liverpool, Liverpool, UK 6 Institute for Systems Biology, Seattle, WA, USA

The Annual 2014 Spring Workshop of the Proteomics Standards Initiative (PSI) of the Human Proteome Organization (HUPO) was held this year jointly with the metabolomics COordination of Standards in MetabOlomicS (COSMOS) group. The range of existing MS standards (mzML, mzIdentML, mzQuantML, mzTab, TraML) was reviewed and updated in the light of new methodologies and advances in technologies. Adaptations to meet the needs of the metabolomics community were incorporated and a new data format for NMR, nmrML, was presented. The molecular interactions workgroup began work on a new version of the existing XML data interchange format. PSI-MI XML3.0 will enable the capture of more abstract data types such as protein complex topology derived from experimental data, allosteric binding, and dynamic interactions. Further information about the work of the HUPO-PSI can be found at http://www.psidev.info.

Eric Deutsch (Institute for Systems Biology, ISB, Seattle, WA, USA) welcomed delegates to the annual Spring Workshop of the HUPO-PSI, which this year was held jointly with the COSMOS (COordination of Standards in MetabOlomicS) at Schloss Reinhartshausen, Eltville, Frankfurt, Germany. He briefly summarized the activities of the group over the last 12 months, including the publication of guidelines for reporting quantitative MS-based experiments in proteomics MIAPE (minimum information about a proteomics experiment) Quant [1], the mzQuantML data standard for quantitative studies in proteomics [2], a new reference implementation of the PSICQUIC web service [3], and a number of new tools built on both new and existing data standards.

Martin Beck (EMBL, Heidelberg, Germany) talked about the work of his group looking at protein complexes by chemical cross-linking [4]. The premise of the technique is to derive structural information via linking peptides (for example lysine residues) in the intact protein with a chemical cross-linker of known length, followed by enzymatic digestion and MS for identification of regular and crosslinked peptides. Cross-linking MS can provide information on protein structure and folding, binding interfaces, and complex topology. The use of tools such as xQuest and xProphet for the identification and statistical validation of cross-linked peptides identified by MS was also discussed (http://proteomics.ethz.ch).  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

2364

S. Orchard et al.

Proteomics 2014, 14, 2363–2368

Reza Salek (EMBL-EBI, Hinxton, UK) gave an overview of the COSMOS initiative (http://www.cosmos-fp7.eu/), which is working to promote adoption of the standardization and exchange formats for metabolomics data. This group is following the work of the Metabolomics Standards Initiative (MSI), which already established a minimum information standards and data exchange formats in 2007, but failed to achieve widespread adoption of these by the community, mainly due to a lack of applications. The recent establishment of resources such as the MetaboLights database at EMBL-EBI (http://www.ebi.ac.uk/metabolights/) and the Metabolomics workbench by the NIH (http://www.metabolomicsworkbench.org/nihmetabolomics) has resulted in a need for data exchange and therefore data standardization. It is proposed to adopt mzML for MS data from LC–MS to GC–MS (GC–MS) experiments and mzTab for capturing metabolite identification and quantification results. However, XMLbased NMR data exchange standards, are being developed ab initio by this group. Next, Daniel Schober (IPB Halle, Germany) described nmrML, a new vendorneutral open exchange and storage format for NMR data. A controlled vocabulary (CV) called nmrCV has been developed for the annotation of the data. Parsers are being written to allow interchange between proprietary vendor formats and nmrML. A first announcement of the availability of nmrML for public consultation was made in February 2014 (http://www.nmrML.org) and feedback is being collected. A specification document, an XSD (XML Schema Definition), CV, and example XML files are in preparation but need to be extended to cover additional use cases such as multidimensional and processed NMR data. Future plans in the group include the extension of the nmrML core, and the initiation of work on tab-separated formats (nmrTab) for NMR-based metabolite identification and quantitation data. The final speaker in the introductory session, Gil Omenn (University of MA, USA), spoke about the work of the Human Proteome Project (http://www.thehpp.org/) of the Human Proteome Organization (HUPO-HPP). The aim of this group is to make proteomics research a full counterpart to genomics in enhancing biomedical research and to complete, in stepwise manner, a protein parts list including all isoforms, posttranslationally processed protein chains, amino acid polymorphisms, and PTMs in the human proteome. Two parallel approaches are being taken to produce this parts list—a chromosome centric approach in which the protein content of each chromosome is being separately investigated by a specific country or region (the C-HPP branch), and a biology/disease centric approach where proteins are derived from specific biological samples (the B/D-HPP branch). Strict dataset guidelines have been enforced, which include use of the PSI data standards (when possible) and all C-HPP original datasets must be submitted to a resource of the ProteomeXchange consortium, such as PRIDE (for MS/MS data) or PASSEL/PeptideAtlas (for selected reaction monitoring (SRM) data) [5]. The PXD dataset identifier must appear in the last line of the abstract of any HPP paper appearing in the dedicated special issues of the Journal of Proteomic Research (JPR). Efforts are currently ongoing to identify the “missing” proteins, predicted from annotation of the human proteome but the existence of which has not yet been confirmed at the protein level. There are plans in place to perform proteomics experiments on unusual cell types and tissues, including those derived from the early stages of life, and under conditions of stress, and undertaking top-down proteomics of intact proteins to better identify splice and coding variants. Special efforts will be made to identify members of families of proteins not generally amenable to MS proteomics such as transmembrane helical structures or those lacking tryptic cleavage sites, or highly homologous proteins. The meeting then separated into the distinct working groups. There is now a comprehensive set of formats and standards produced by the MS group including mzML, the format for mass spectrometer output and TraML, a format for SRM transitions lists, a controlled vocabulary (CV) for the annotation  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

Proteomics 2014, 14, 2363–2368

2365

of both, and the MIAPE suite of minimum requirements documents [6]. However, advances in technologies provide new challenges, which need to be met by these standards, including the application of mzML to metabolomics, SWATH-MS and other data-independent acquisition workflows, and ion mobility MS. It was suggested that a new version of mzML (1.2) may be required to support enhanced data compression to better manage the extremely large file sizes generated by the current version (1.1). After extensive discussion, however, the meeting concluded that mzML is suitable for metabolomics as is, with only the addition of new CV terms required to meet the current needs of this community. A fuller feasibility study is already ongoing but GC/MS data from Thermo and LECO instruments have been successfully mapped into mzML. Software has also been written and is being tested in ProteoWizard to encode and use ion mobility MS data in mzML. Existing best practices and CV terms appear to be sufficient and no new changes are needed. A section of the meeting was devoted to reviewing other sections of the CV and sections, which contained a number of merged concepts. This resulted in both the addition of new terms, to meet recent demands on the existing standards, splitting of the merged concepts, and obsoleting of a number of terms, which had previously been marked as inappropriate. To tackle the issue of data compression, the use of mgzip, as a well-defined extension of the gzip file format, as an additional mzML file was first discussed. This would ensure that the file remains backwards compatible with existing gzip programs, formats, and proteomics APIs (application programming interfaces). The protocol provides only slightly less compression than gzip while achieving almost plain text read performance for random reads. No external index files are required as these use gzip headers blocks and embedded indexes. The combination of the new compression method MS-Numpress and mgzip yields mzML files that are often smaller than vendor files [7]. It is proposed that mzML 1.2 will recommend, rather than mandate, the use of MS-Numpress compression plus whole-file mgzip compression methods to store MS data more efficiently. Issues remaining to be resolved by this group include deciding the means by which synthesized MS2 spectra acquired from MSE , i.e. a data-independent approach that acquires MS1 and MS2 mass spectra in an unbiased and parallel manner, and also how merged, clustered spectra should be captured in mzML. Current plans for the latter include expanding the existing CV terms to record the source spectra at each cluster, but a firm resolution was not reached. The PEFF (PSI Extended Fasta Format) format was also discussed briefly. It will provide a common Fasta format for sequence databases to include details of sequence variation (e.g. state of protein maturation, mutations, PTMs), also details of the database version and description. The format has been through the HUPOPSI documentation procedure, which has identified a few minor required updates and once these are addressed, it will be ready for resubmission and publication. A writer to PEFF is already available at neXtProt and a viewer and converter are available, developed by the Compomics group. The proteomics informatics (PI) group produces standards and formats pertaining to the identification and quantitation of peptides present in MS samples, and their subsequent mapping to proteins. The group originally published MIAPE-MS informatics (MSI) containing both identification guidelines and some fairly limited quantification guidelines. As quantitative techniques have become more refined, the MIAPE-Quant guidelines have been developed, to support the mzQuantML data standard for MS-based quantitative studies. During the meeting, the encoding of SRM data in mzQuantML was reviewed with some minor improvements in usage of CV terms identified. A new section of the mzQuantML specification document has been drafted. Models for absolute quantitation in mzQuantML were also discussed. Several semantic rules will be required to encode this, along with coordinated updates to PSI validator.  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

2366

S. Orchard et al.

Proteomics 2014, 14, 2363–2368

Work was undertaken to prepare the updated MIAPE MSI 1.2 guidelines ready for submission to the PSI documentation process. The mzIdentML standard for peptide/protein identification (currently v. 1.1) also requires updation to meet the needs of new techniques, such as further work on protein grouping to include formal rules on the representation of “protein groups”—where the identification of proteins may be ambiguous for example due to shared peptides [8]. The inclusion of scoring data and statistical thresholds for protein groups was considered, as was support for peptide-level statistics. Improved support for the use of multiple search engines in mzIdentML was also discussed. The problems associated with chemical cross-linking studies were considered at great length. As an example, search engines may give overall scores for the identification, or individual scores for the alpha and beta chain. A new model for searches of cross-linked peptide was developed, using specific CV terms. The scoring of ambiguous modification positions within a single peptide was also included in the updated standard. All these refinements and new features will be included in mzIdentML 1.2 that is planned for final refinement and release in late 2014/early 2015. The design strategy for mzIdentML 1.2 is to maintain a backwards compatible schema and accommodate updates via new CV terms and mapping rules. The needs of the metabolomics community were also discussed by this group. Many of their needs can be met by mzTab [9], the lightweight supplement to the existing standard XML-based file formats (mzIdentML and mzQuantML). Two sections and a number of new fields will however be needed and these will be part of a planned future 1.1 release. mzQuantML was originally developed with some limited support for small molecule data, which needs to be tested thoroughly if the metabolomics community is to make more use of this format. The main work of the molecular interaction group was to work on the next version of the PSI-MI XML standard to enable the capture of additional use cases of the data, as previously discussed in the 2013 HUPO-PSI Liverpool workshop. Many “housekeeping” elements were initially agreed on; these mainly address issues with the current version of the XML (XML2.5.4), which do not work well in practice. The ability to exchange “abstracted” data, i.e. knowledge built from experimental data such as protein complex composition/topology and cooperative interactions, for example allostery, were modeled into the new schema. The ability to add information on dynamic interactions, i.e. changes in response to time or changing concentration of agonist was improved, as was a more systematically method for adding information on mutants, and also on their subsequent effect on an interaction. The requirement to capture the causality of molecular interactions needs further discussion with external groups to ensure we have adequate data capture and in an appropriate format. The expertise in this area lies largely in pharmaceutical industry and will be approached via the EMBL-EBI Industry program. In order to advance the schema as rapidly as possible, it was agreed that example files containing representative data for each use case in PXI-MI XML2.5 would be collected, and the data then remapped into the prototype 3.0 schema. The schema, plus examples, would then be published via the PSI website and advertised for community feedback. At the end of a 60-day consultative period, the schema would be submitted to the PSI-MI document procedure and a publication prepared in parallel. No changes are planned to the more light-weight, tab-delimited format, MITAB. As part of the discussion on changes to the data format, Sandra Orchard (EMBLEBI) demonstrated the Complex Portal (www.ebi.ac.uk/intact/complex), an online resource to collate information on protein complexes with search and download facilities already available. All data are held in IntAct database and the IntAct team is responsible for data maintenance and consistency but the editorial tool is being made available to specialist teams to add macromolecular complexes relevant to their area of expertise.  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

Proteomics 2014, 14, 2363–2368

2367

The group then discussed the development of the JAMI Java application programming interfaces. JAMI is a single Java library designed to unify both the MITAB and PSI-XML standard formats by providing a common Java framework, while hiding the complexity of both from the na¨ıve user. Once the first version of the JAMI core data model has been released, subsequent tool development will be made easier as tools need not be format specific. The urgent requirement for an enhanced tool suite, including non-Java options was identified and some groups volunteered to make in-house scripts used on the data publicly available. Tutorial material needs also to be developed as the new format and accompanying tool suite are published. The meeting ended with a talk by Carsten Kettner (Beilstein-Institut, Germany) who spoke about the work of the STRENDA (Standards for Reporting Enzymology Data) Initiative. In order for comparative analyses to be undertaken, enzymologists rely on having a certain amount of basic information about the biochemical reactions they are studying, for example a detailed description of the methodology, the enzyme preparation, and assay condition. They also require information about measured kinetic parameters and catalytic mechanisms, inhibition, and activation parameters and other factors affecting the reaction (http://www.beilstein-institut.de/en/projects/strenda). STRENDA was founded in 2003 and has been supported by the Beilstein-Institut since then to define the minimum information for reporting functional enzyme data and to generate a comprehensive data acquisition and storage system (STRENDA DB). STRENDA DB aims to provide both authors and journals an assessment tool with which authors, journals’ editors, and reviewers can check whether the reporting of experimental data is STRENDA-compliant and thus matches the instructions for authors from the journals [10]. The data entered in the assessment tool is stored in STRENDA DB until publication of the data, at which point the corresponding database entry will be made publicly accessible. It was recognized that to gain broad acceptance by the community, it is important to avoid creating guidelines or tools, which are so burdensome as to prohibit their widespread use. The major requirement to the design of the assessment tool was to reflect the structure of a journal manuscript, to support rapid and easy data entry, and semiautomatic completion of mandatory fields. A validation procedure identifies input errors. The data can then be readily displayed and exported in a format appropriate for Supplementary materials. There are issues that version 1.0 does not tackle, for example no cross checking the position of PTMs against the current version of the protein sequence in the sequence databases and there are no plausibility checks of kinetic parameters. The use of CVs and ontologies could also be improved throughout but a first release of the current implementation is planned for the second half of 2014. Henning Hermjakob (EMBL-EBI) closed the meeting by thanking both attendees and speakers, and initiating the search for a venue for the 2015 meeting. The time and place of the 2015 workshop will be announced on the PSI website as soon as it has been decided. The HUPO-PSI would like to acknowledge the enormous contribution made to this work by Juan-Pablo Albar (1953–2014) who enthusiastically supported our efforts for many years. His untimely death has deprived us of both a valued friend and colleague and one of our most passionate contributers. The authors would like to acknowledge the contribution of the EU FP7 grants “ProteomeXchange” (grant number 260558), COSMOS (grant number EC312941), and the BBSRC grant PROCESS (BB/K01997X/1).

References ´ S., Deutsch, E. W., Binz, P. A., Jones, A. R. et al., Guidelines [1] Mart´ınez-Bartolome, for reporting quantitative mass spectrometry based experiments in proteomics. J. Proteomics 2013, 95, 84–88

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

2368

S. Orchard et al.

Proteomics 2014, 14, 2363–2368

[2] Del-Toro, N., Dumousseau, M., Orchard, S., Jimenez, R. C. et al., A new reference implementation of the PSICQUIC web service. Nucleic Acids Res. 2013, 41, W601– W606. [3] Walzer, M., Qi, D., Mayer, G., Uszkoreit, J. et al., The mzQuantML data standard for mass spectrometry-based quantitative studies in proteomics. Mol. Cell. Proteomics 2013, 12, 2332–2340. [4] Bui, K. H., von Appen, A., DiGuilio, A. L., Ori, A. et al., Integrated structural analysis of the human nuclear pore complex scaffold. Cell 2013, 155, 1233–1243. [5] Vizca´ıno, J. A., Deutsch, E. W., Wang, R., Csordas, A. et al., ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 2014, 32, 223–226. [6] Orchard, S., Data standardization and sharing-the work of the HUPO-PSI. Biochim. Biophys. Acta 2014, 1844, 82–87. [7] Teleman, J., Dowsey, A. W., Gonzalez-Galarza, F. F., Perkins, S. et al., Numerical compression schemes for proteomics mass spectrometry data. Mol. Cell. Proteomics 2014, 13, 1537–1542. [8] Seymour, S. L., Farrah, T., Binz, P. A., Chalkley, R. J. et al., A standardized framing for reporting protein identifications in mzIdentML 1.2. Proteomics 2014, 14, 2389–2399. [9] Griss, J., Jones, A. R., Sachsenberg, T., Walzer, M. et al., The mzTab Data Exchange Format: communicating MS-based proteomics and metabolomics experimental results to a wider audience. Mol. Cell. Proteomics 2014, DOI: 10.1074/mcp.O113.036681. [10] Apweiler, R., Armstrong, R., Bairoch, A., Cornish-Bowden, A. et al., A large-scale protein-function database. Nat. Chem. Biol. 2010, 6, 785.

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

COSMOS Workshop: 13-15 April 2014, Frankfurt, Germany.

The Annual 2014 Spring Workshop of the Proteomics Standards Initiative (PSI) of the Human Proteome Organization (HUPO) was held this year jointly with...
164KB Sizes 2 Downloads 5 Views