Anal Bioanal Chem (2015) 407:2659–2663 DOI 10.1007/s00216-014-8438-8

FEATURE ARTICLE

“All proteins all the time”—a comment on visions, claims, and wording in mass spectrometry-based proteomics Wolf D. Lehmann

Received: 6 November 2014 / Revised: 16 December 2014 / Accepted: 18 December 2014 / Published online: 25 February 2015 # Springer-Verlag Berlin Heidelberg 2015

In comprehensive protein analysis by mass spectrometrybased proteomics (MSP), MS1 and MS2 data on peptides, protein sequences, and probabilistic algorithms are combined to create protein identification reports. Natural protein sequences comprise only a tiny fraction of all possible sequences that can be created by random combination of proteinogenic amino acids. This restriction is in accordance with the evolution of all forms of life from a common origin and facilitates the probabilistic annotation of a peptide sequence on the basis of incomplete analytical information, which is the basic concept of MSP. The current standard technology in MSP is bottom-up proteomics (also named shotgun proteomics). This is the combination of protein digestion and analysis of proteolytic peptides by LC–MS/MS. Sequence information from MS2 spectra of peptides is notoriously incomplete, so that complete sequencing by MS2 data alone (de-novo peptide sequencing) is not possible from most MS2 spectra acquired in a bottomup experiment. MSP attempts to overcome this lack of information by using protein sequence databases as supplementary information. Search engines combine the fragmentary MS2 sequence information with database sequence information to annotate a complete peptide sequence to peptide MS2 spectra. The probability that the annotation is correct is expressed as a score. Complete genomic sequencing of bacteria and higher organisms, together with instrumental progress in mass

W. D. Lehmann (*) Core Facility Molecular Structure Analysis, German Cancer Research Centre (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany e-mail: [email protected]

spectrometry, led to the development of MSP. The probabilistic aspect of MSP markedly expands the number of identified and/or quantified proteins from LC–MS1– MS2 data sets compared with the use of analytical data alone. However, this advantage is achieved at the expense of the result list containing several false entries. This error rate is usually named the false discovery rate (FDR). A reliable FDR estimation requires a large dataset and an appropriate estimation of the true proteome size. Identifications with high scores contain a low fraction of false results, whereas identifications with scores near the cut-off have a much higher percentage of errors. Normally, the average FDR is given. In detail, probability-based analysis in bottom-up MSP has two steps. Step 1 is the annotation of a peptide sequence to an MS2 spectrum with the help of a database-supported probabilistic algorithm. Step 2 is inferring the probable source protein of an annotated peptide sequence. The guideline in this step is that the smallest set of proteins by which the occurrence of the annotated peptides can be explained is regarded as Bidentified^. The development of MSP gained momentum when the draft sequence of the human genome was completed in approximately 2001. The impressive success of MSP since then has had a great effect on the whole field of mass spectrometry in the life sciences, on the style of results discussion, on the ranking of research topics, and on the general style of scientific publications. It seems that omics-type analyses are on the way to being regarded as the best-practice method for the whole field of mass spectrometry. In this article, a critical view of this development is given in the form of nine theses. The thesis format was selected to achieve brevity and clarity.

2660

W.D. Lehmann

Defining proteomics as a progression of genomics is disputable Since its beginning, proteomics has been promoted as a logical progression of genomics. Consequently, a major objective of proteomics was and still is detection of the Bcomplete proteome^, in analogy to complete sequencing of a genome. However, fundamental genome–proteome differences mean there is a high scientific risk in defining complete proteome analysis as a major objective for proteomics in the same way as complete genome analysis is defined as the central objective of genomics. Essential genome–proteome differences are listed in Table 1. Numerous MSP publications claim an almost complete coverage of the human proteome or of the proteome of other organisms. What is probably meant here is coverage of the respective encoded proteome, which includes the proteins corresponding to the protein-coding regions of the genome. Because of numerous chemical and enzymatic co- and posttranslational covalent modifications, the true size of the human proteome exceeds that of the encoded human proteome by an unknown factor. The true proteome is currently unknown in size, highly dynamic, and cell-type specific. Its size cannot be derived from the number of genes. A direct consequence of promoting a conceptual genomics– proteomics analogy is that the protein count in MSP is taken as a measure of scientific excellence. In fact it is rather a measure of instrument and workflow performance. When accepting the protein count as the main objective scientists practice a voluntary self-incapacitation, because their expertise is not essential.

Proteomics is an extension of non-probabilistic protein analysis and not vice versa The probabilistic interpretation of LC–MS1–MS2 spectra of peptides has led to a great increase in analytical productivity. However, all MSP reports contain a fraction of false results indistinguishable from the correct ones. Therefore MSP is markedly different from standard MS technology, because it has a much higher error rate than the latter. Attempts to redefine standard MS Table 1 Fundamental differences between the genome and proteome

analyses as a subgroup of omics-type analyses are particularly problematic. A typical example of such a redefinition is the use of the term Btargeted proteomics^ to describe analysis of a single protein or a family of structurally or functionally related proteins. This term is misleading in two ways: because it describes a method without the omics-typical comprehensiveness, and because it does not use the omics-typical probabilistic data evaluation. Another example of a disputable terminology is to name proteins with natural isotopic composition as BSILAC light labeled^, ascribing the property of being labeled to a group of definitely unlabeled molecules.

Error reduction in MSP requires improved bioinformatics, not manual curation The idea that proteomics results might be improved by manual control or Bcuration^ is misleading. First, this concept implies that an extra criterion exists that is ignored by the search algorithm but available to the controlling expert. This situation occurs only rarely, and general manual control and/or curation is simply impracticable given the typical amount of MSP data. However, the scoring of results can be improved by incorporation of so-far unused knowledge into the final scoring algorithm. Implementation of such improvements into search engines requires expertise in MS methods and in bioinformatics.

Errors in a database created by MSP (e.g. false phospho-sites) cannot be removed Probabilistic annotation of modifications is associated with a substantial error rate, because the allowance of modifications results in a large expansion of the search space. Several proteinphosphorylation-site databases have been created by MSP experiments focused on the phosphoproteome. The assignment as Bcurated database^ cannot change the inherent error rate caused by the probabilistic method of creation. The absence of a modification is almost impossible to prove by MS methods. The consequence of this situation is that, in practice, a false entry in a modification database generated by an MSP procedure cannot

Genome

Proteome

Size Building blocks

Exactly defined 4 + some modified units

Unknown 20 + numerous modified units

Abundance range Repair system Biological dynamics Localization Safety Shell Synthesis

1 or several gene copies Highly effective Mainly stable Nucleus Nucleus, histones Linked to mitosis, tightly controlled

9 orders between protein abundances No repair, processing, variable turnover Moderately to highly dynamic Everywhere None Both continuous and stimulated gene expression

All proteins all the time

be removed. Use of a database created by an MSP method as a reference database in a subsequent MSP experiment will lead to a further increase of the error rate.

Mass spectrometry-based proteomics leads to a cult of unprocessed data Unprocessed data (Boriginal data, raw data^) are per-se Bcorrect^, whereas result report lists generated by search engines necessarily contain a portion of false entries, which cannot be distinguished from the correct ones. Therefore, unprocessed data generally have a special reputation and value, irrespective of the context in which they were generated. This data cult leads to massive efforts regarding long-term storage of original data, e.g. for their possible reanalysis at a later time by a different algorithm. The impression is that storage efforts and actual benefits are highly divergent. Bioinformatics experts may profit from access to original datasets for refinement of their algorithms. However, most mass spectrometry experts do not profit from original-data repositories. There is the implicit idea that raw data or original data might contain a wealth of information, currently non-extractable as a result of software imperfection or for other reasons. Although this is supported by intuition, it is also highly speculative. The essential information in MSP datasets may equally probably be extracted almost perfectly by current search engines.

2661

Streamlined wording in MSP creates an Ball problems solved^ impression The typical MSP manuscript focuses on biological results extracted from a large body of MS1 and MS2 spectra. Method descriptions are often removed from the main text and shifted to supplemental information, and discussion of pros and cons of applied methods is often absent. MSP manuscripts with a focus on the method exist, but are very rare. Although the experimental techniques of analytical proteomics are becoming more and more sophisticated, the typical terminology in MSP creates the inverse impression by the preferred use of simple, everyday-life wording. The impression created by such streamlined presentation is that little expertise is required to generate and interpret proteomics data, that all necessary know-how is published, that all essential problems are solved, and that all necessary tools are commercially or publicly available. Discussions at mass spectrometry conferences are also affected by this impression. BYou mentioned several difficulties in your proteomic analyses. Why do you not first make a complete proteome analysis and then pick out the details you are interested?^ or BHow far off are you from complete proteome analysis?^ BWell, some improvements in the workflow, some more instrumental sensitivity, some speedup of data acquisition rate, then we are there.^

Quantitative MSP is also error-prone, because it is a hybrid of MSP and standard quantification methods In MSP a marginal analytical coverage is sufficient for an Balmost complete picture of the proteome^ Bottom-up MSP uses the pars-pro-toto (a part stands for the whole) principle at two stages: (i) in the step from the MS2 spectrum of a peptide to a sequence annotation, and (ii) in the step from the annotated peptide sequence to the source-protein inference. In the first step, the pure MS2 data on average cover only one third of the peptide sequence; the rest is inferred from protein sequence database information with the help of probabilistic information. In the second step, the annotated peptides are used to infer the corresponding intact source-proteins. Here, often a minimum of two peptides is regarded as the identification criterion, and the typical sequence coverage of identified proteins is approximately 10 % or below. Taken together, these numbers mean that an analytical data set with very low coverage of the proteome is sufficient to achieve what is described as an Balmost complete picture of the proteome^. It seems to be widely unknown that comprehensive proteome characterizations in MSP are typically built on such analytical data sets.

Because quantitative proteomics is an extension of proteomics, it contains two types of error: the identification and the quantification error. This means that a quantitative result can be false as a result of an erroneous identification of the peptide and/or protein, although the quantification algorithms are used correctly. The variance of an intensity-ratio measurement is minimal at ratio 1. The probability of a true difference between control and analyte increases with the extent that the intensity ratio deviates from ratio 1. However, the more this ratio differs from unity, the higher the probability of a low signal intensity in one of the samples (control or analyte) or in both. A low signal intensity, in turn, is associated with an increased probability of a false peptide and/or protein identification.

A large number of poorly defined MSP terms and acronyms hinders scientific discussion Some examples are given to substantiate this thesis.

2662

W.D. Lehmann

Proteomics

Shotgun proteomics vs. bottom-up proteomics

The meaning of this term ranges from comprehensive protein analysis (as a vision or reality) to any type of protein analysis.

BShotgun proteomics^ vividly describes data-dependent LCMS/MS analysis of proteolytic peptides. It is the most widespread technique in MSP, and is now mostly named bottomup proteomics. Possibly this change has been favored because Bbottom-up^ implies systematic data generation, whereas Bshotgun^ accentuates more the incomplete and partly accidental character of data acquisition.

Proteome coverage Calculation of proteome coverage requires an estimate regarding the actual size of the proteome. The size of the encoded proteome can be used as a substitute, but this point is rarely discussed. It is often unclear whether the encoded-proteome size provides a realistic estimate of the actual proteome size under investigation. Protein identification Important protein details that cannot be detected by MSP include intact molecular weight, presence and distribution of modifications, truncations, and mutations. Global proteomics and “proteomics and beyond” Because proteomics in the original definition (all proteins in a cell or organism) is already comprehensive, the need for the term Bglobal proteomics^ is difficult to understand. Proteomics and beyond is a cryptic expression.

MSE MS with elevated collision offset. A molecular ion fragmentation technique without precursor-ion selection, described as Ball fragment ions all the time^. MSE is a pseudo-MS2 technique, which generates fragment-ion spectra in combination with only one stage of mass separation. This technique has been developed to reduce the run-to-run variability typically observed in shotgun or bottom-up proteomics. SWATH (sequential windowed acquisition of all theoretical fragment-ion spectra) This technique is an intermediate between MSE (fragmentation of the complete MS1 scan range) and shotgun or bottomup proteomics (fragmentation of selected precursor ions). In SWATH, the MS1 scan range is divided into segments, which are then fragmented sequentially.

Heavy–light Targeted proteomics This term describes pairs of labeled–unlabeled peptides. Heavy–light is more colorful than the scientific denotation. However, it confers to all biomolecules with natural isotopic abundance the property Blight^. SILAC mouse (stable isotope labeling with amino acids in cell culture mouse) This is a concise abbreviation at the price of an internal contradiction. Absolute SILAC This term describes in vivo labeling of cells, purification of a single protein, and its final absolute quantification by a conventional method of protein quantification.

This describes standard molecular protein analysis without omics attributes, because it is focused on a single protein or a family of highly related proteins. This term is highly unfounded, but nevertheless seems to be broadly accepted. False discovery rate This is the error rate. In this term the error is ennobled with the name of a discovery. An MSP result report has to contain a specific number of false hits to enable a meaningful estimation of the error rate. The FDR estimation depends on multiple instrumental and bioinformatics variables. An accepted procedure for its estimation is to perform a search using a nonsensesequence database (sequence-inverted or otherwise manipulated) and count the positive hits. Native ESI, native MS, and native ion mobility MS

Super SILAC This term describes the use of one batch of SILAC-labelled cells as an internal standard for a set of nonlabeled cells within, e.g., a kinetic experiment.

The experimental basis for the use of Bnative^ in connection with ESI-MS is that, in contrast with the standard acidic buffer of pH 2, an ammonium acetate buffer of approximately neutral pH is used. However, this buffer already differs from the

All proteins all the time

original cellular environment. In addition, accepted models for electrospray ionization of proteins make it unlikely that proteins will retain their native structure during the ionization process. Deep proteome This refers to the detection of proteins of low abundance and evokes the positive association of Bdeep sequencing^. Deep sequencing describes sequencing of the same DNA region 30 to 100 times to reduce the error rate. Nothing comparable is realized in the analysis of the Bdeep proteome^. In contrast, analytical LC-MS/MS data of the Bdeep proteome^ usually have a poor signal-to-noise ratio as a result of the low abundance of the corresponding proteins. In conclusion, MSP is an analytical technique composed of methods from physics, physico-chemistry, chemistry, biology, and bioinformatics. The flood of data created by MSP can only be converted into results by bioinformatics algorithms. Great efforts are made to present MSP as a regular but somehow empowered analytical technology. This is reflected in numerous pseudo-simple abbreviations generating allusions to everyday life, and the avoidance of detailed method descriptions and discussions. In reality, MSP instrumental technology and algorithms have reached a level of sophistication that is understood only by a small number of experts. This development has occurred so fast that scientific communication within the MS community is severely affected by a lack of common expertise and language. The result is a kind of speechlessness with respect to method discussions in this field. Several steps may improve this situation. Obviously, the function of bioinformatics in mass-spectrometry education

2663

has to be strengthened. In this effort, the marked method differences between the varieties of omics technology based on mass spectrometry must be accentuated, not leveled. Finally, the field requires an improved scientific terminology as the basis for internal and external communication. When this development can be continued while maintaining scientific standards, omics technology may further consolidate mass spectrometry as one of the leading analytical sciences.

Acknowledgments This manuscript was developed from the closing lecture entitled BMass Spectromics^ of the 47th Annual Meeting of the Deutsche Gesellschaft für Massenspektrometrie, Frankfurt, March 2014.

Wolf D. Lehmann is former mass spectrometry group leader of the Core Facility Molecular Structure Analysis of the German Cancer Research Center (DKFZ) in Heidelberg, Germany. His technical focus is on nanoESI-MS/ MS, nanoLC–ESI-MS/MS, and microLC–ICP-MS, and in this context his major interests include quantification of covalent protein modifications and chemical and/ or biochemical production of stable isotope-labeled peptide and protein standards. Novel applications comprise phospholipid profiling and quantification, and structural and quantitative analysis of protein phosphorylation in the context of cellular signal transduction. Collaboration with biochemical and biomedical research projects has been an essential aspect of his work.

"All proteins all the time"--a comment on visions, claims, and wording in mass spectrometry-based proteomics.

"All proteins all the time"--a comment on visions, claims, and wording in mass spectrometry-based proteomics. - PDF Download Free
153KB Sizes 0 Downloads 7 Views