Computational proteomics: Integrating mass spectral data into a biological context.

Journal of Proteomics 129 (2015) 1–2

Contents lists available at ScienceDirect

Journal of Proteomics journal homepage: www.elsevier.com/locate/jprot

Editorial

Computational proteomics: Integrating mass spectral data into a biological context☆

A little over 20 years ago an algorithm, that correlates experimental mass spectra to those theoretically generated from a sequence database, was published in a journal devoted to mass spectrometry, and it catalyzed a revolution [1]. In essence, SEQUEST allowed us to stand on the shoulders of genomics scientists and leverage all information generated by their projects to the emerging field of proteomics. As proteomics continued to evolve at a rapid pace, it became evident that computer science played more and more of a key role in paving the way for revolutionizing biochemical and biomedical research. Nowadays the cornerstone of proteomics analysis has become the simultaneous identification and quantitation of the protein components of biological systems using increasingly sophisticated and sensitive mass analyzers. Paradoxically, the increased capabilities of mass spectrometry has led to many proteomic groups functioning as a ‘black box’ in regard to the handling and meaningful processing of the huge amount of data generated in high-throughput mass spectrometry-based proteomics experiments. Although a complete set of sequence-specific and satellite fragment ions represents the Rosetta Stone to unambiguously reconstruct the structure of the parent ion after MS/MS analysis, assessment of each individual product ion spectrum recorded in high-throughput proteomics measurements is nearly impractical, even for experienced mass spectrometrists. Decoding the information contained in the fragmentation spectra generated in data-independent large-scale proteomic experiments requires data reconstruction through probabilistic inference. A major challenge of this approach involves ranking the collection of compatible peptide matches through the assignment of statistical confidence scores that quantify the agreement between observed and predicted data. Ultimately such algorithms help shed light on the biology of complex biological samples such as organelles, cell lysates, biological fluids, and tissues [2]. As the field moved on, computational proteomics came to offer a complete arsenal of tools for peptide/protein identification, quantification, and quality assessment [3], thus enabling cutting-edge analytical methods such as selected reaction monitoring [4] (SRM), as well as labeled and label-free quantitation. Today it is inconceivable to think on proteomics without considering specialized pattern recognition and statistical learning algorithms, which have evolved to accomplish far more complex tasks than those of proteomics' cousin field, that of genomics. This is because proteomic experiments comprise multiple ways of acquiring data, ranging from bottom-up, to middle-down, to top-down, and even data-independent analysis [5]. Moreover, proteomics embraces challenges that consider the vast universe of post-translational modifications, pinpointing subtle changes ☆ This article is part of a Special Issue entitled: Computational Proteomics.

http://dx.doi.org/10.1016/j.jprot.2015.10.013 1874-3919/© 2015 Published by Elsevier B.V.

in the abundance of biomolecules and even helping to define protein structures and protein interactions. In more recent years, great efforts have also been directed toward the adoption of best practices [6], standardization of software outputs, and especially proteomics data deposition to make it more accessible and reusable [7]. From a hardware perspective, impressive new mass spectrometry technologies have been deployed on a yearly basis, which include improvements in scan rates, sensitivity, and accuracy. Mass spectrometry has become accessible to more groups, thus facilitating data access to more researchers. Yet unlocking the full potential of these cutting edge technologies is only possible for groups comprised of cross-disciplinary expertise, i.e. having specialists in biology, chemistry, and computer science. Indeed, the collaboration between experts in these fields marked the birth of Computational Proteomics. Groups with such a blend are more exposed to groundbreaking discoveries and thus have been in the forefront of the field. In contrast, groups only having expertise on data generation are always bound to follow behind the steps of others because there is no autonomy to blaze their own path [8]. Here we present JOP's first issue fully devoted to computational proteomics; it consolidates the application of existing procedures and exposes new and promising alternatives for solving open challenges by bringing to fore contributions from leading groups. This focus issue opens with a review by Dr. Veit Schwammle et al., from Dr. Peter Roepstorff's group, providing a bird's-eye-view of the various computational methods for the high-throughput analysis of proteins' PTMs. We then have Dr. Tao Xu et al., and Dr. Yaoyang Zhang et al., both originating from Dr. John Yates' lab, pushing the limits of the peptide spectrum matching workflow by presenting a peptide spectrum matching search engine, termed ProLuCID, and a statistical filter for pinpointing confident protein identifications. Dr. Hao Chi et al., from Dr. Si-Min He's group, demonstrates a powerful algorithm, p-Find-Alioth, for unrestricted database searching in high-resolution MS/MS data, which enables the identification of unanticipated PTMs and mutations. Juliana Fischer et al., and Diogo Borges et al., both from Dr. Paulo Carvalho's group, debut algorithms for statistically scoring phosphopeptide sites and for identifying cross-linked peptides, respectively. We note that the latter, made available through a tool termed SIM-XL, is the first to adopt standards that allow data deposition in the PRIDE public repository. Also regarding standards, we have Gerhard Mayer, from Dr. Martin Eisenacher's group, introducing ProCon, PROteomics CONversion tool, which enables converting the output of several search engines to standard formats such as mzIdentML and Pride XML. Dr. Oliver Horlacher et al., make available mzJava, an open-source library to aid bioinformaticians in designing applications that process, cluster, align,

2

Editorial

annotate, and score MS/MS spectra originating from glycans or peptides. Still on glycans, we have Dr. Alejandro E. Brito et al., from Dr. Marshall Bern's group, disclosing Cartoonist, a very user-friendly software to aid researchers in evaluating potential glycan biomarkers. Dr. Yehia Farag et al., together with Dr. Harald Barsnes, make available a webbased tool that helps us visualize proteomic results and extract important information. Dr. Sarah R Langley and Dr. Manuel Mayr make available an in-depth comparison of statistical methodologies for detecting proteins with differential abundance according to label-free proteomics data. Dr. Yassene Mohammed and Dr. Christoph Borchers, disclose an extensive library of surrogate peptides of all human proteins. Dr. Lars Malmström et al. address a very hot topic, that of data-independent analysis (DIA), and demonstrates how to combine genomics and quantitative DIA MS proteomics data. In addition, to spice up things even more in the realm of DIA, we have Dr. Guo Shou Teo and Dr. Hyungwon's et. al, launching mapDIA, a tool that performs a rigorous data preprocessing in the context of multi-sample DIA data analysis to determine differential expression status. Last, but not least, there is Dr. Hyungwon Choi, Sinae Kim, Damian Fermin et al., together with Dr. Alexey I. Nesvizhskii, making available QProt, the latest version of their algorithm for pinpointing proteins with differential abundance according to protein-level intensity data in DDA label-free quantitative proteomics. In summary, we proudly make available a special issue that brings contributions from leading groups in the field of computational proteomics originating from Brazil, Canada, China, Denmark, England, Germany, the Netherlands, Norway, Singapore, Sweden, Switzerland, and United States.

References [1] J.K. Eng, A.L. McCormack, J.R. Yates, An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database, J. Am. Soc. Mass Spectrom. 5 (1994) 976–989, http://dx.doi.org/10.1016/1044-0305(94)80016-2. [2] M.P. Washburn, D. Wolters, J.R. Yates III, Large-scale analysis of the yeast proteome by multidimensional protein identification technology, Nat. Biotechnol. 19 (2001) 242–247, http://dx.doi.org/10.1038/85686.

[3] A.I. Nesvizhskii, A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics, J Proteomics 73 (2010) 2092–2123, http://dx.doi.org/10.1016/j.jprot.2010.08.009. [4] P. Picotti, R. Aebersold, Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions, Nat. Methods 9 (2012) 555–566, http://dx. doi.org/10.1038/nmeth.2015. [5] J.D. Venable, M.-Q. Dong, J. Wohlschlegel, A. Dillin, J.R. Yates, Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra, Nat. Methods 1 (2004) 39–45, http://dx.doi.org/10.1038/nmeth705. [6] F. da Veiga Leprevost, V.C. Barbosa, E.L. Francisco, Y. Perez-Riverol, P.C. Carvalho, On best practices in the development of bioinformatics software, Front. Genet. 5 (2014)http://dx.doi.org/10.3389/fgene.2014.00199. [7] Y. Perez-Riverol, E. Alpi, R. Wang, H. Hermjakob, J.A. Vizcaíno, Making proteomics data accessible and reusable: current state of proteomics databases and repositories, Proteomics 15 (2015) 930–949, http://dx.doi.org/10.1002/pmic.201400302. [8] N. Bandeira, A. Nesvizhskii, M. McIntosh, Advancing next-generation proteomics through computational research, J. Proteome Res. 10 (2011) 2895, http://dx.doi. org/10.1021/pr200484b.

Paulo C. Carvalho Computational Mass Spectrometry Group, Carlos Chagas Institute, Fiocruz, Paraná, Brazil Corresponding authors. E-mail address: [email protected]. Gabriel Padron Proteomics Department, Center for Genetic Engineering and Biotechnology, Havana, Cuba Juan J. Calvete Instituto de Biomedicina de Valencia, Consejo Superior de Investigaciones Cientificas, 46010 Valencia, Spain Yasset Perez-Riverol European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Corresponding authors. E-mail address: [email protected].

Untargeted, spectral library-free analysis of data-independent acquisition proteomics data generated using Orbitrap mass spectrometers.

Integrating "big data" into surgical practice.

Integrating Biodiversity Data into Botanic Collections.

DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics.

An evolving computational platform for biological mass spectrometry: workflows, statistics and data mining with MASSyPup64.

Integrating nTMS Data into a Radiology Picture Archiving System.

A Data Structure for Rapid Mass Spectral Searching.

Functional annotation and biological interpretation of proteomics data.

The BRAIN Initiative Provides a Unifying Context for Integrating Core STEM Competencies into a Neurobiology Course.

Integrating multi-scale knowledge on cardiac development into a computational model of ventricular trabeculation.

Integrating cancer genomic data into electronic health records.

Integrating social context into comprehensive shared care plans: A scoping review.

Stronger findings from mass spectral data through multi-peak modeling.

Integrating genetics and epigenetics in breast cancer: biological insights, experimental, computational methods and therapeutic potential.

Integrating meta-analysis of microarray data and targeted proteomics for biomarker identification: application in breast cancer.

Integrating genomic, transcriptomic, and interactome data to improve Peptide and protein identification in shotgun proteomics.

Integrating genomics and proteomics data to predict drug effects using binary linear programming.

Integrating the meaning of person names into discourse context: an event-related potential study.

Large-scale protein-protein interactions detection by integrating big biosensing data with computational model.

Discover context-specific combinatorial transcription factor interactions by integrating diverse ChIP-Seq data sets.

ms-data-core-api: an open-source, metadata-oriented library for computational proteomics.

Learning protein-DNA interaction landscapes by integrating experimental data through computational models.

Computational analyses of spectral trees from electrospray multi-stage mass spectrometry to aid metabolite identification.

Computational proteomics in the post-identification era.