Proteomics 2014, 14, 2637–2646

2637

DOI 10.1002/pmic.201400157

REVIEW

N-terminomics and proteogenomics, getting off to a good start Erica M. Hartmann and Jean Armengaud Biology and the Built Environment Center, Institute of Ecology and Evolution, University of Oregon, Eugene, OR, USA

Proteogenomics consists of the annotation or reannotation of protein-coding nucleic acid sequences based on the empirical observation of their gene products. While functional annotation of predicted genes is increasingly feasible given the multiplicity of genomes available for many branches of the tree of life, the accurate annotation of the translational start sites is still a point of contention. Extensive coverage of the proteome, including specifically the N-termini, is now possible, thanks to next-generation mass spectrometers able to record data from thousands of proteins at once. Efforts to increase the peptide coverage of protein sequences and to detect low abundance proteins are important to make proteomic and proteogenomic studies more comprehensive. In this review, we present the panoply of N-terminus-oriented strategies that have been developed over the last decade.

Received: April 23, 2014 Revised: April 23, 2014 Accepted: August 8, 2014

Keywords: Gene annotation / N-terminomics / Peptide enrichment / Peptide signal / Proteogenomics / Technology

1

The importance of N-termini

The synthesis of proteins proceeds in a stepwise fashion from the N-terminus to the C-terminus, but the N-terminus of a mature protein may be quite different from the one predicted from the genome sequence. The finished sequence of a mature protein is the result of several steps, starting with translation initiation and potentially followed by PTM or signal peptide removal, all of which are coordinated by complex interactions between ribosomes, mRNAs, tRNAs, and various enzymes that act on nascent polypeptides. Translation initiation sites are important not only because they determine the N-terminus of the nascent protein but also because they can play a major role in the regulation of protein synthesis [1]. The genetic context of the start codon helps determine how efficiently ribosomes will bind the transcript and thus Correspondence: Dr. Erica M. Hartmann, Biology and the Built Environment Center, Institute of Ecology and Evolution, University of Oregon, 5389 University of Oregon, Eugene, OR 97403, USA E-mail: [email protected] Fax: +1 541-346-2364 Abbreviations: COFRADIC, combined fractional diagonal chromatography; SCX, strong cation exchange; TMPP, trimethoxyphenyl phosphonium

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

influences the kinetics of translation. Although the general mechanism of translation is conserved across all organisms, which is surely proof of its evolutionary importance [2], the small differences that exist between organisms can have a significant impact on the finished form of a protein. The process of translation is slightly more complicated in eukaryotes than prokaryotes, but as such, the rules for where and how translation is initiated are much more strict. In eukaryotes, once the mRNA has been synthesized and processed, notably with a 7-methylguanosine cap at the 5 end, translation begins the first AUG in the transcript, which corresponds to the initial methionine in the protein [1]. While exceptions to this rule are rare, they are nonetheless important. For example, translation will sometimes begin at a downstream AUG if the first start codon is not found within in the consensus sequence GCCRCCAUGG, or translation can reinitiate if the first coding sequence is very short (< 30 codons) [1]. In some cases, translation can initiate in the absence of a 5 cap or at an internal ribosome entry site [3, 4]. These mechanisms are often associated with viruses [3, 4]. Furthermore, translation initiation by viruses at internal ribosome entry sites is not constrained to AUG start codons [4]. In some very rare cases, generally associated with disease states, transcription begins where an expansion mutation has inserted certain repeats, a process known as

www.proteomics-journal.com

2638

E. M. Hartmann and J. Armengaud

repeat-associated non-ATG translation [5]. In spite of these exceptions, a great deal of control to prevent initiation at non-AUG codons is present in the preinitiation complex of proteins that recognize start sites and begin translation [1, 3]. That these exceptions are most frequently associated with perturbations (i.e. viruses, disease states, genetic mutations) makes them all the more relevant to biomedicine. In prokaryotes, translation usually initiates at a start codon that is approximately 7 nt downstream from a Shine– Dalgarno ribosome-binding site [1, 6]. While the AUG start codon is the most commonly observed, GUG, UUG, and sometimes AUU can also support translation, albeit still with an initial formyl-methionine residue, and the exact location, length, and sequence of the Shine–Dalgarno site varies [1, 6]. The presence of an appropriate Shine–Dalgarno site and start codon is not sufficient for translation: the secondary structure of the mRNA can prevent access to the ribosome, favoring the initiation of translation at another site [1, 6]. While instrumental, the Shine–Dalgarno site is not necessary; the ribosome can also bind purine-rich regions upstream of the coding region [6], or leaderless translation initiation can occur when the start codon is located at the 5 end of the transcript [1,4,6]. Leaderless translation or translation in the absence of a Shine–Dalgarno site is especially common in Archaea [4]. From the beginning, the N-terminus of the nascent polypeptide takes on a unique importance because it often contains information regarding the eventual fate of the mature, functional protein, or whether indeed it will reach maturity. In eukaryotes, the initial methionine residue is removed in a sequence-dependent manner by an aminopeptidase concurrent to translation [7, 8]. The newly exposed N-terminal amino acid can then be acetylated (Fig. 1), which makes the nascent protein a target for degradation via the N-end rule degradation pathway [8]. The stability of N-acetylated proteins, which comprise approximately 90% of human proteins, thus depends on steric shielding to hide the so-called “N-recognin” that would otherwise condemn the nascent polypeptide to rapid degradation [8]. Acetylation is performed by multiple enzymes with different specificities [7], implying that the process occurs to a variety of sequences, but not every copy of a protein is acetylated [8]. In contrast to N-terminal acetylation, N-terminal deamidation or arginylation increase the stability of the nascent polypeptide [8]. N-terminal methionine excision can similarly occur in Archaea and prokaryotes, when proceeded by deformylation [7, 9]. The initial methionine residue in bacteria, as well as in mitochondria and chloroplasts, is typically formylated (Fig. 1) [9]. However, the inability to formylate methionine is not strictly lethal and translation can be initiated with a nonformylated methionine residue, albeit less efficiently [6, 9, 10]. The formyl group is then removed in almost all proteins concurrent to translation [9], and the initial methionine can then be excised, again dependent on the sequence of the protein [7]. The N-terminus of a protein may contain a signal peptide that governs its fate within, or outside of, the cell. These signal peptides are often, but not always, cleaved once  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2014, 14, 2637–2646

the protein reaches its destination. For example, in eukaryotes certain sequences guide nuclear-encoded proteins to different subcellular locations, such as mitochondria or chloroplasts. The chloroplast transit peptides are cleaved once the protein has reached the chloroplast, and acetylation of the new N-terminus may then follow [11]. For some proteins, the chloroplast transit peptide is followed by another signal peptide, the luminal transit peptide, which specifies its further transport into the lumen of the chloroplast; these proteins do not undergo N-terminal acetylation because the appropriate enzymes do not exist in the lumen [11]. The exact sequences of chloroplast transit peptides vary depending on the species [11], and some peptides are dual targeting, meaning that they can signal transport to both mitochondria and chloroplasts [12]. However, sequences intermediate to the transit peptide may constitute “avoidance signals,” negating the transit peptide signal [12], so putative signal peptides may not in fact be cleaved and would thus remain intact at the N-terminus of the mature protein. In prokaryotes, N-terminal tags identify proteins destined for secretion, which are then relieved of their signal peptides by proteases located in the periplasm [13]. However, signal peptides are not always removed, as is the case for the type III secretion system, which plays an important role in interactions between bacteria and their hosts [14]. The type III secretion system transports bacterial proteins directly into the host cells, a feature that is often necessary for pathogenesis [14]. In all forms of life, the N-terminus can also contain regulatory domains that govern the activity of the mature protein, and changes to the N-terminus can alter enzymatic specificity. For example, the N-terminal arm of the regulator protein PhoP in Mycobacterium tuberculosis is species specific, as opposed to the rest of the protein, which contains highly conserved domains [15]. While the N-terminal domain is not necessary for structural stability, it affects the protein’s function as a transcriptional regulator [15]. In another example, the sequence of the N-terminus of the protease subunit ClpP is highly conserved, but changes to its length, either by truncation or by leaving part of the signal peptide intact, have an effect on the substrate specificity and degradation rate [16]. In addition to signal peptide cleavage, some proteins require proteolysis to reach maturity. For example, insulin is first synthesized as preproinsulin, a single polypeptide with an N-terminal signal peptide. It has to be cleaved twice, once to remove the signal peptide and again to form the fully mature insulin. Proteolytic activation is also required for zymogens or proenzymes, including trypsin, which is originally synthesized as trypsinogen, and hepatocyte growth factor, which is secreted as a single polypeptide chain and cleaved by the extracellular protease urokinase [17], to expose the active site of the mature enzyme. Similarly, proteinase-activated receptors, a class of transmembrane G-protein-coupled receptors, are activated by site-specific cleavage of the N-terminus, which otherwise inhibits ligand binding [18, 19]. Subsequent inactivation can then occur via cleavage of the binding domain or the tethered ligand domain, sometimes by the same www.proteomics-journal.com

Proteomics 2014, 14, 2637–2646

2639

Figure 1. Common in vivo N-terminal states. Protein Ntermini are frequently found as free alpha amines, formylmethionine, or acetylated. Free alpha amines and the side chains of certain residues (lysine, histidine, and arginine) are protonated at physiological pH.

protease that activates the receptor [19]. These receptors can be activated and deactivated by a variety of proteases [18,19], implying that the specific cleavage sites may vary. The observed N-termini of these mature protease-activated proteins can thus be vastly different from those predicted from the genome sequence or even those produced during translation.

2

In silico prediction of protein N-termini from whole genome sequences

Armed thus with a certain knowledge regarding the mechanisms of transcription initiation, PTM, and signal peptide cleavage, automated algorithms to predict genes from genome sequences have been created. These algorithms are often supplemented with machine learning seeded with empirical data. However, gene prediction algorithms are far from perfect. Even in eukaryotes, where translation initiation is fairly strict, gene calling is still problematic [20]. The problem is exacerbated for prokaryotes, where the rules for translation initiation are more flexible. Algorithms have to take into account multiple start codons, multiple ribosome-binding sequences, and the possibility that a coding sequence may not have an upstream ribosome-binding sequence. The genetic context of the start codon in prokaryotes can vary not only from species to species but also from gene to gene within a given genome [21]. For that reason, many algorithms for translation initiation site prediction in prokaryotes look for a genetic context characteristic of translation initiation sites but without specifying that context a priori. For example, a given species may appear to not use a conserved ribosome-binding sequence when in fact it employs an alternative ribosomebinding sequence that is conserved but previously went unnoticed, such as the TANNNT sequence detected by ProTISA, a translation initiation site prediction algorithm for bacteria and Archaea [21]. This algorithm was used to create an  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

alternate to the RefSeq database, and the translation initiation sites are estimated to be 10% more accurate than RefSeq, whose accuracy ranges from 3 to 97% depending on the species [22]. The most common error in RefSeq is the overestimation of longest open reading frames [22], which is possibly a remnant of the eukaryotic 5 -most start codon rule. A slew of translation initiation site prediction programs and databases exist, including ProTISA, EasyGene, GeneMarkS, Glimmer3, TiCo, and Prodigal, the accuracies of which depend on the organism in question and the GC content of the genome [22, 23]. With the advent of metagenomics, many of these programs have been extended to identify translation initiation sites from the genomes of prokaryotic communities. In addition to all of the difficulties faced by gene prediction in single-strain genomes, gene prediction in metagenomics is further complicated by the read length, which is often short, meaning that the upstream regions of coding sequences may be missing, and the fact that the individual genomes contributing to the metagenome are often unknown [24]. The accuracy of metagenomic gene predictions is difficult to gauge given the lack of references; experimentally verified translation initiation sites are few and far between in metagenomic samples [24]. Similar to translation initiation site prediction, many algorithms for signal peptide cleavage site prediction rely on machine learning to recognize the surrounding context [25]. In general, signal peptides are characterized by the presence of certain N-terminal hydrophobic regions and neutral residues around the cleavage site [25]. However, signal peptides may bear a certain resemblance to transmembrane helices [26], and many algorithms also have trouble distinguishing signal anchors, which are not cleaved, from signal peptides, which are [13]. Nevertheless, several programs exist, including SignalP, Signal-3L, Signal-BLAST, Signal-CF, SPEPlip, Phobius, and Philius, and their accuracy is improving [13, 25]. Signal P 3.0 is reported to be the most accurate and sensitive prediction method in the absence of transmembrane helix domains,

www.proteomics-journal.com

2640

E. M. Hartmann and J. Armengaud

while SignalP 4.0 is better at distinguishing between these features [26]. However, the performance of all algorithms depends on the type of organism being studied [26], which is possibly due to the quality of the training dataset relative to the true diversity of signal peptides and cleavage sites. As for translation initiation sites, training and evaluation of signal peptide prediction algorithms is severely limited by the availability of experimentally verified data. Historically, the N-termini of mature proteins were determined by Edman degradation [45], in which the N-terminal amino acids of a protein are identified one by one, but the current rate at which genomes are sequenced and the rate at which Edman degradation can be performed are incompatible [13]. For this reason, the use of proteogenomics, in which proteomic data are used to augment genome annotations, is appealing for both the validation of translation initiation sites and signal peptide cleavage events, among other things [13, 27]. Proteogenomics has the advantage of being high-throughput, so data from the N-termini of an entire proteome can be collected simultaneously. Multiple proteogenomic approaches exist [27], although they are not all specifically aimed at identifying N-termini. Regardless of the particular method employed, proteogenomic studies will vastly increase the number of experimentally verified translation initiation sites, N-terminal methionine excisions, and signal peptide cleavages, thereby allowing the amelioration of prediction algorithms and the automated annotation of genome sequences to come.

3

N-terminomics technologies

Most strategies for the identification of the N-termini of an entire proteome, called the N-terminome, are based on bottom-up proteomics, in which the proteins are digested into their component peptides prior to detection using MS (Fig. 2). In its simplest form, N-terminomics consists of the collection of the most N-terminal peptides of all proteins identified in a proteomic study [13]. However, although standard bottom-up methods are excellent for the description and comparative analysis of proteomes, they often produce limited information about N-termini because of the sheer complexity of the sample, both in terms of the number and dynamic range of proteins, and correspondingly peptides. For example, of the 1348 proteins identified in a shotgun proteomic analysis of the Deinococcus deserti bacterium, only 136 distinct N-termini, corresponding to 112 proteins, were observed [28]. To specifically identify N-termini, certain high-throughput methods have thus been developed. These N-terminomics methods consist of two parts: enrichment of N-terminal peptides, which are then detected using MS, and bioinformatic data analysis to interpret the results. While advances in MS are certainly advantageous for N-terminomics, a detailed discussion of instrument development is outside the scope of this review. Rather, we will focus on the various strategies for the enrichment of N-terminal peptides  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2014, 14, 2637–2646

and the ensuing challenges for data analysis. Enrichment strategies are generally divided into those that use positive selection to concentrate N-terminal peptides and those that use negative selection to deplete extraneous peptides (Table 1). It is possible to take advantage of the physicochemical properties conferred by in vivo acetylation to enrich N-terminal peptides; however, not all proteins are acetylated, so many methods further derivatize peptides before and/or after protease digestion to more clearly distinguish between N-terminal and extraneous peptides (Fig. 1). Proteomes can also be divided into subcellular or extracellular fractions prior to N-terminomics analysis. Indeed, N-terminomics of exoproteomes may reveal export peptide signal or anchor sequences [29]. The enzyme trypsin cleaves polypeptide chains following lysine or arginine residues, except when followed by proline. The resulting peptides have two positive charges: one from the C-terminal basic residue and one from the free amine at the N-terminus (Fig. 1). However, N-terminally acetylated peptides, originating from N-terminally acetylated proteins, only have one positive charge because the N-terminal amine is blocked. This charge difference can be used to separate singly charged N-acetylated peptides from doubly charged internal peptides using strong cation exchange (SCX) [30]. Other enzymes, such as LysN, which cleaves before lysine residues, produce uncharged N-terminally acetylated peptides and singly charged internal peptides. To increase coverage, multiple proteases producing peptides of different lengths and charge states can be used. A dimethylation step (Fig. 3) can be introduced prior to SCX to increase the resolving power and reduce the number of internal and unmodified N-terminal peptides that co-elute with the N-acetylated fraction [31]. The dimethylation is introduced after digestion to

Figure 2. General strategies for N-terminomics sample preparation. Whole proteomes or fractions thereof can be analyzed directly using top-down proteomics or digested and analyzed using bottom-up proteomics. Once digested, N-terminal peptides can be selectively enriched or internal and C-terminal peptides can be depleted to yield a less-complex sample.

www.proteomics-journal.com

2641

Proteomics 2014, 14, 2637–2646 Table 1. Strategies for the enrichment of N-terminal peptides

Positive selection

Negative selection

Label

Enrichment method

N-terminal fraction

None Dimethyl TMPP Biotin SATA TNBS TMPP Dimethyl Dimethyl (TAILS) iTRAQ (TAILS) Dimethyl/phospho Dimethyl/acetyl

SCX SCX Affinity capture Affinity capture SPE COFRADIC COFRADIC SPE Filtration Filtration Affinity capture ChaFRADIC

Acetylated b) Acetylated Unmodified Unmodified Unmodified a) All Unmodified All All All c) All All

Quantitative?

a)

Yes

Yes

Yes Yes Yes

Reference

[30] [31] [35] [36] [41] [32] [16] [37] [38] [38] [39] [40]

a) Except His-containing peptides. b) Except N-acetyl peptides ending in Arg. c) Except phospho peptides. ChaFRADIC, charge-based fractional diagonal chromatography; SATA, N-succinimidyl S-acetylthioacetate; TAILS, terminal amine isotopic labeling of substrates; TNBS, 2,4,6-trinitrobenzenesulfonic acid.

block the free amines at the N-termini of the digested peptides as well as on any lysine residues, thus increasing the difference in basicity between N-acetylated and extraneous peptides. While this approach is straightforward, it is only valid for N-acetylated proteins. The negative selection used in combined fractional diagonal chromatography (COFRADIC) enables the detection of both acetylated and unmodified N-terminal peptides, but the procedure is much more complicated [32]. In this method, proteins are first reduced and alkylated, followed by an acetylation step to block all free amines, both from unmodified N-termini and lysine residues. Proteins are then digested and subjected to SCX as above to remove doubly charged

internal peptides. A reverse-phase HPLC step is used to fractionate the peptide mixture, containing N- and C-terminal peptides, as well as some remaining internal peptides, and new amines generated during proteolysis are then labeled with 2,4,6-trinitrobenzenesulfonic acid (Fig. 3). Thus rendered more hydrophilic, C-terminal and remaining internal peptides are removed in a second chromatography step performed using the same conditions, and N-terminal peptides, which elute at the same time as in the first chromatography step, are analyzed using MS/MS. COFRADIC has been combined with another chemical label, namely trimethoxyphenyl phosphonium (TMPP; Fig. 3) [16]. TMPP has the advantage of not only modifying

Figure 3. N-terminal chemical tags.

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

2642

E. M. Hartmann and J. Armengaud

the chemical properties of N-terminal peptides, making it easier to separate them from extraneous peptides, but also enhancing their ionization and thus their detection using MS. TMPP labeling can be used without any further enrichment to identify N-terminal peptides [33, 34]. In the COFRADIC protocol, proteins are first labeled with TMPP and then reduced, alkylated, acetylated, and digested. The resulting peptides are then separated twice using reverse-phase HPLC with an intermediate 2,4,6-trinitrobenzenesulfonic acid labeling step, as in the original COFRADIC protocol. TMPP labeling can also be used in a positive selection approach wherein TMPP-labeled N-terminal peptides are enriched using anti-TMPP monoclonal antibodies [35]. In another affinity-based approach, lysine-blocked proteins are labeled with biotin prior to digestion (Fig. 3), and biotinylated N-terminal peptides are collected using streptavidin [36]. Similar to the immmunocapture of TMPP-labeled peptides, this positive selection method is only applicable to unmodified N-termini. Dimethylation can also be used with SPE in a negative selection method to retain all N-terminal peptides, whether or not they are acetylated, such as in dimethyl isotope-coded affinity selection [37]. In this method, proteins are dimethylated prior to digestion, so all N-terminal amines are blocked, either with an in vivo modification such as acetylation or with chemical dimethylation. Enzymatic digestion then creates new, unblocked primary amines that are then covalently bound to aldehyde groups on a solid phase. The blocked N-terminal peptides are thus depleted of extraneous peptides and analyzed using MS. Two different samples can be labeled with either heavy or light dimethyl groups, enabling the quantitative comparison of N-terminomes. In a similar protocol, dimethylated or in vivo modified N-terminal peptides are separated from extraneous peptides with free amines by binding the latter to a polymer, which is then excluded by filtration [38]. This method, called terminal amine isotopic labeling of substrates, has also been used with iTRAQ labels, and as such, up to eight samples can be compared in a single experiment. In another method that uses dimethylation to protect the free amines of unmodified N-termini, extraneous peptides are further differentiated by phospho tagging their alpha amines (Fig. 3) [39]. Thus, derivatized, extraneous peptides are removed using TiO2 affinity. While this method provides excellent elimination of extraneous peptides, any naturally phosphorylated N-terminal peptides will also be depleted. Charge-based fractional diagonal chromatography also uses dimethylation prior to digestion [40]. In this method, peptides are first separated into fractions by their charge state using SCX; internal and C-terminal peptides with free alpha amines produced after digestion are then trideutero acetylated, and a second SCX step is performed on each fraction to collect N-terminal peptides whose alpha amines were blocked prior to digestion and thus their charge states did not change between the first and second SCX fractionations. The collection of multiple fractions circumvents the unintentional separation of histidine-containing N-terminal  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2014, 14, 2637–2646

peptides (Fig. 1), which contain an additional positive charge, with doubly charged internal peptides as opposed to singly charged N-terminal peptides that do not contain histidine residues. SPE can also be used for positive selection of N-terminal peptides. Derivatization of unmodified N-termini with N-succinimidyl S-acetylthioacetate converts free amines to thiol groups (Fig. 3) that are then covalently bound to a thiopropyl Sepharose resin [41]. To prevent reaction with cysteine-containing peptides, proteins are reduced and alkylated prior to digestion. Because these derivatizations are extremely selective, and the reaction yields are near complete, this method is compatible with quantitative studies, either with isotopic labeling or label-free methods such as spectral counting.

4

Current limitations to N-terminomics approaches

Regardless of the approach, the first obstacle of any N-terminomic profile is coverage. Proteomes depend on the condition in which they were generated, e.g. the environment in which the cells were grown, the growth stage, etc. These conditions affect not only whether certain genes will be expressed but also whether the corresponding proteins will be modified, e.g. by protease activation. Complete coverage is further hindered by the dynamic range of the proteome. Even with the best techniques and instrumentation, the N-termini of lowly abundant proteins may be poorly detected compared to those of the most abundant proteins [35]. The second major complication is that proteins may exist simultaneously in multiple forms. For example, even if the conditions favor activation of a given protein via a certain protease, not all instances of that protein are necessarily activated in the entire cell or population used to generate the sample. It has to be taken into account that by chance, some individual proteins may be in the process of being synthesized and thus have not yet reached maturity, while others may be undergoing degradation. The population may be rendered even more heterogeneous due to the presence of multiple proteases, or other enzymes involved in PTM, that can process the protein of interest but at different sites [13, 42]. Furthermore, even if a certain N-terminal peptide is present, if its physicochemical properties, e.g. molecular weight, ionization, and fragmentation behavior, are not amenable to detection using MS, it will not be observed [39]. Some limitations can be avoided while others become apparent depending on the study design and the particular methods used. For example, using a label- and enrichmentfree approach where signal peptide cleavage is detected in a standard proteomic experiment, it is impossible to identify signal peptides that would have been generated in the course of digestion with the chosen enzyme (e.g. trypsin) [13]. The addition of a chemical label helps to discriminate between N-terminal and internal peptides, which can be particularly www.proteomics-journal.com

2643

Proteomics 2014, 14, 2637–2646

useful for the enrichment of N-terminal peptides. However, nonspecific labeling, particularly of epsilon amines on lysine residues, and incomplete labeling of alpha amines are a constant challenge. Even when nonspecific labeling occurs at a low rate, nonspecifically labeled peptides from very highly abundant proteins may vastly outnumber specifically labeled peptides from lowly abundant proteins, inhibiting the detection of the latter [35]. Prelabeling degradation, e.g. during the course of protein extraction and sample preparation, can also be a major concern. The analysis of a standard mixture yielded 81% of identifications corresponding to native N-termini, implying that 19% were artifacts produced at some point during the sample preparation or analysis [36]. In addition, the presence of in vivo N-terminal modifications, while certainly of interest from a biological point of view, inhibits chemical labeling, thus preventing identification of these N-termini using positive selection methods. During the enrichment, certain artifacts can be introduced, such as N-terminal glutamate cyclization to pyroglutamate, cysteine carbamidomethylation, asparagine cyclization, and monomethyl as well as cation adducts [32, 39, 43, 44]. Some N-terminomics analyses result in an unusually high percentage of unidentified spectra, e.g. 85% of spectra that are not matched to a peptide [44]. These unmatched spectra may represent peptide sequences that are not predicted in the reference database (e.g. from unpredicted signal peptide cleavages), unaccounted for chemical modifications (e.g. artifacts or unpredicted PTMs), or unusual fragmentation behavior due to the chemical tag used to select the peptides, as can be the case for TMPP, for which a and b ions are generally predominant over y ions in the MS/MS fragmentation spectra [33]. Indeed, the accuracy of peptide spectrum matching decreases when the peptides in consideration are modified, even if the modification is something as banal as acetylation simply because the database search space is larger [54]. Furthermore, the fortuitous combination of modification masses can lead to false identifications, e.g. the misinterpretation of acetylation in the presence of a potassium adduct as a phosphorylation [44]. While one goal of sample preparation is to select or retain only N-terminal peptides, enrichment is often incomplete. The persistence of extraneous peptides is particularly pronounced when noncovalent enrichment methods, such as those based on electrostatic interactions, are used. For example, only 56% of peptides identified using SCX to enrich acetylated N-terminal peptides were in fact N-acetylated, whereas 98% of peptides identified after enrichment based on the covalent interaction of thiolated peptides with a Sepharose resin contained N-terminal thiols [31, 41]. The effectiveness of antibody-based enrichment depends largely on the affinity of the particular interaction, but enrichment can reach 97– 99% [35]. The complete enrichment of N-terminal peptides facilitates MS analysis by removing competing signals from extraneous peptides. However, it also complicates downstream bioinformatic analysis because the majority of algorithms for the identification of proteins from MS data preferentially  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

assign peptides to proteins from which another peptide has previously been identified, whereas by definition only one peptide per protein should be identified after successful enrichment [32, 36–38]. Finally, as the quantitation, in addition to identification, of N-terminal modifications may be biologically relevant, protocols resulting in nonquantitative yields or extensive, biased sample loss can become an issue. Many methods for the labeling and enrichment of N-terminal peptides involve several chemical transformations and extractions, thus requiring the sample to be transferred at multiple points in the protocol. Each additional step introduces another opportunity to bias the sample composition or decrease the overall yield, increasing the chances that results will not be quantitative or that lowly abundant N-termini will not be observed. Some enrichment methods have specific biases, such as the loss of histidine-containing peptides during SCX [32, 40] or phospho peptides during TiO2 affinity [39] (Table 1), so the expected properties of the N-terminome in question should be taken into account when selecting the analytical method.

5

Confirmation strategies

Several strategies exist to corroborate N-terminomics observations and to complement N-terminomics studies. One common way to substantiate the identification of N-terminal peptides is to use multiple biological and technical replicates, as well as multiple proteases to produce N-terminal fragments of different lengths, to increase the number of times a particular N-terminal peptide is identified [16, 36, 38]. Isotopically labeled tags can be used not only for quantitation but also to duplicate peptide observations [34]. Partial proteolysis can also be used to produce peptides of different lengths. However, the overlap between multiple MS runs or N-termini identified following digestion with different enzymes is limited, mainly because of the poor fragmentation of nontryptic peptides. The reproducibility of technical replicates in one study was only 70% [36]. In another study that used endoproteinase GluC and chymotrypsin in addition to trypsin for protein digestion, 66% of N-termini were only observed in one digestion, while 25% were present in two, and only 9% were present in all three [16]. The utility of this validation method is thus limited. Increasing the stringency of the identification criteria and controlling the false discovery rate increases the statistical significance of the identifications [32, 36]; however, the functional relevance of an N-terminus that is only observed once remains dubious. Agreement with in silico predictions or identification of plausible transcription and translation scenarios from the genome sequences or those of close relatives are also often reported. However, N-terminomics observations do not always fit established models [13], which is not necessarily indicative of faulty observations. Targeted biochemical and complementary MS-based proteomic studies can also confirm some N-terminomics observations. Protein purification and sequencing via Edman www.proteomics-journal.com

2644

E. M. Hartmann and J. Armengaud

degradation is one possible method, albeit only for unmodified N-termini [45]. N-terminal modifications can also be confirmed using top-down proteomics, in which the molecular weights of whole proteins are measured, and subsequent fragmentation can yield sequence information [37, 46]. Specific PTMs can be identified by incubating synthetic peptides or proteins in vitro with purified proteases or other enzymes; MALDI-TOF MS can then be used to ascertain whether the enzymes are indeed active on the putative substrates [42]. MALDI-TOF MS can also be used for the de novo sequencing of N-terminal peptides, although certain caveats regarding modification and the molecular weight of the peptide remain [47]. Molecular biology approaches can also be used to complement N-terminomics results. For example, proteogenomic data can be confirmed using reverse transcription PCR to verify the existence of corresponding mRNAs [48, 49]. This method is particularly useful if a protein proves to be longer than expected or if the gene is not found in the original genome annotation. Other high-throughput technologies can also be used to corroborate N-terminomic observations. Especially if variations are observed regarding translation initiation or the existence of previously unannotated genes, methods targeting mRNA, such as RNA sequencing and ribosome profiling, are particularly attractive. For example, RNA sequencing can be used to as supporting evidence for the translation of leaderless mRNAs [50]. Ribosome profiling, in which translation is inhibited immediately following ribosome binding to transcripts, reveals translation start sites and as such can confirm the existence of alternative translation initiation sites, open reading frames that start upstream of predicted start sites, and non-AUG start sites as well as the translation of putative noncoding RNAs [51–53].

6

Conclusion

Although a fundamental aspect of all forms of life, the synthesis and processing of proteins are far from uniform. Important differences, however subtle, in the mechanisms of translation initiation, PTM, signal peptide removal, and protease-mediated maturation render the prediction of mature protein sequence, structure, and function from genome sequences difficult to say the least. Differences in start codon usage, ribosome-binding site sequences or indeed leaderless initiation, signal peptide sequences and cleavage sites exist between species, especially within prokaryotes, and the involvement of viruses further complicate the matter in eukaryotes. Even N-terminal methionine removal, which is highly conserved, varies between organisms [54]. Current models for gene prediction, based on so few examples, are probably not perfectly applicable to all organisms. Even models that are specifically adapted to different kingdoms vary in accuracy depending on the organism [55]. As is the case for machine learning, the larger the training set, the more accurate the resulting model, so as we establish more N-terminomes  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2014, 14, 2637–2646

for different organisms, we will increase our understanding of protein synthesis and processing and thus our ability to predict various properties of mature proteins from genome sequences. Once the set of known true N-termini has been greatly expanded, the difficulty will lie in recognizing the various patterns found in the primary and secondary structure of mRNA transcripts, as well as those in the primary and secondary structure and chemical properties of the corresponding polypeptides, and in applying the appropriate model to each organism. This work was supported by the Commissariat a` l’Energie Atomique et aux Energies Alternatives, and the Agence Nationale de la Recherche (ANR-12-BSV6-0012-01). EMH was supported by a Fulbright grant. The authors have declared no conflict of interest.

7

References

[1] Kozak, M., Initiation of translation in prokaryotes and eukaryotes. Gene 1999, 234, 187–208. [2] Morgens, D. W., The protein invasion: a broad review on the origin of the translational system. J. Mol. Evol. 2013, 77, 185–196. [3] Sonenberg, N., Hinnebusch, A. G., Regulation of translation initiation in eukaryotes: mechanisms and biological targets. Cell 2009, 136, 731–745. [4] Malys, N., McCarthy, J. E. G., Translation initiation: variations in the mechanism can be anticipated. Cell. Mol. Life Sci. 2011, 68, 991–1003. [5] Cleary, J. D., Ranum, L. P. W., Repeat-associated non-ATG (RAN) translation in neurological disease. Hum. Mol. Genet. 2013, 22, R45–R51. [6] Laursen, B. S., Sørensen, H. P., Mortensen, K. K., SperlingPetersen, H. U., Initiation of protein synthesis in bacteria. Microbiol. Mol. Biol. Rev. 2005, 69, 101–123. [7] Bradshaw, R. A., Brickey, W. W., Walker, K. W., N-terminal processing: the methionine aminopeptidase and N-alphaacetyl transferase families. Trends Biochem. Sci. 1998, 23, 263–267. [8] Kim, H. K., Kim, R. R., Oh, J. H., Cho, H. et al., The N-terminal methionine of cellular proteins as a degradation signal. Cell 2014, 156, 158–169. [9] Solbiati, J., Chapman-Smith, A., Miller, J. L., Miller, C. G., Cronan, J. E., Jr., Processing of the N termini of nascent polypeptide chains requires deformylation prior to methionine removal. J. Mol. Biol. 1999, 290, 607–614. [10] Li, Y., Holmes, W. B., Appling, D. R., RajBhandary, U. L., Initiation of protein synthesis in Saccharomyces cerevisiae mitochondria without formylation of the initiator tRNA. J. Bacteriol. 2000, 182, 2886–2892. [11] Zybailov, B., Rutschow, H., Friso, G., Rudella, A. et al., Sorting signals, N-terminal modifications and abundance of the chloroplast proteome. PLoS One 2008, 3, e1994. [12] Ge, C. R., Spanning, E., Glaser, E., Wieslander, A., Import determinants of organelle-specific and dual targeting peptides

www.proteomics-journal.com

Proteomics 2014, 14, 2637–2646

2645

of mitochondria and chloroplasts in Arabidopsis thaliana. Mol. Plant. 2014, 7, 121–136.

ties of Sahara bacterium Deinococcus deserti. PLoS Genet. 2009, 5, e1000434.

[13] Ivankov, D. N., Payne, S. H., Galperin, M. Y., Bonissone, S. et al., How many signal peptides are there in bacteria? Environ. Microbiol. 2013, 15, 983–990.

[29] Armengaud, J., Christie-Oleza, J. A., Clair, G., Malard, V., Duport, C., Exoproteomics: exploring the world around biological systems. Expert Rev. Proteomics 2012, 9, 561– 575.

[14] He, S. Y., Nomura, K., Whittam, T. S., Type III protein secretion mechanism in mammalian and plant pathogens. Biochim. Biophys. Acta-Mol. Cell Res. 2004, 1694, 181–206. [15] Das, A. K., Kumar, V. A., Sevalkar, R. R., Bansal, R., Sarkar, D., Unique N-terminal arm of Mycobacterium tuberculosis PhoP protein plays an unusual role in its regulatory function. J. Biol. Chem. 2013, 288, 29182–29192. [16] Bland, C., Hartmann, E. M., Christie-Oleza, J. A., Fernandez, B., Armengaud, J., N-terminal-oriented proteogenomics of the marine bacterium Roseobacter denitrificans OCh114 using TMPP labeling and diagonal chromatography. Mol. Cell. Proteomics 2014, 13, 1369–1381. [17] Naldini, L., Tamagnone, L., Vigna, E., Sachs, M. et al., Extracellular proteolytic cleavage by urokinase is required for activation of hepatocyte growth-factor scatter factor. Embo. J. 1992, 11, 4825–4833. [18] Macfarlane, S. R., Seatter, M. J., Kanke, T., Hunter, G. D., Plevin, R., Proteinase-activated receptors. Pharmacol. Rev. 2001, 53, 245–282. [19] Ossovskaya, V. S., Bunnett, N. W., Protease-activated receptors: contribution to physiology and disease. Physiol. Rev. 2004, 84, 579–621. [20] van der Burgt, A., Severing, E., Collemare, J., de Wit, P. J., Automated alignment-based curation of gene models in filamentous fungi. BMC Bioinformatics 2014, 15, 19. [21] Hu, G. Q., Zheng, X., Yang, Y. F., Ortet, P. et al., ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes. Nucleic Acids Res. 2008, 36, D114–D119.

[30] Mohammed, S., Heck, A. J. R., Strong cation exchange (SCX) based analytical methods for the targeted analysis of protein post-translational modifications. Curr. Opin. Biotechnol. 2011, 22, 9–16. [31] Chen, S.-H., Chen, C.-R., Chen, S.-H., Li, D.-T., Hsu, J.L., Improved N␣-acetylated peptide enrichment following dimethyl labeling and SCX. J. Proteome Res. 2013, 12, 3277– 3287. [32] Staes, A., Impens, F., Van Damme, P., Ruttens, B. et al., Selecting protein N-terminal peptides by combined fractional diagonal chromatography. Nat. Protoc. 2011, 6, 1130–1141. [33] Baudet, M., Ortet, P., Gaillard, J. C., Fernandez, B. et al., Proteomics-based refinement of Deinococcus deserti genome annotation reveals an unwonted use of noncanonical translation initiation codons. Mol. Cell. Proteomics 2010, 9, 415–426. [34] Bertaccini, D., Vaca, S., Carapito, C., Arsene-Ploetze, F. et al., An improved stable isotope N-terminal labeling approach with light/heavy TMPP to automate proteogenomics data validation: dN-TOP. J. Proteome Res. 2013, 12, 3063–3070. [35] Bland, C., Bellanger, L., Armengaud, J., Magnetic immunoaffinity enrichment for selective capture and MS/MS analysis of N-terminal-TMPP-labeled peptides. J. Proteome Res. 2014, 13, 668–680. [36] Timmer, J. C., Enoksson, M., Wildfang, E., Zhu, W. et al., Profiling constitutive proteolytic events in vivo. Biochem. J. 2007, 407, 41–48.

[22] Hu, G. Q., Zheng, X., Ju, L. N., Zhu, H., She, Z. S., Computational evaluation of TIS annotation for prokaryotic genomes. BMC Bioinformatics 2008, 9, 1471–2105.

[37] Shen, P. T., Hsu, J. L., Chen, S. H., Dimethyl isotope-coded affinity selection for the analysis of free and blocked Ntermini of proteins using LC-MS/MS. Anal. Chem. 2007, 79, 9520–9530.

[23] Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L. et al., Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010, 11, 119.

[38] Kleifeld, O., Doucet, A., Prudova, A., Keller, U. A. D. et al., Identifying and quantifying proteolytic events and the natural N terminome by terminal amine isotopic labeling of substrates. Nat. Protoc. 2011, 6, 1578–1611.

[24] Liu, Y. C., Guo, J. T., Hu, G. Q., Zhu, H. Q., Gene prediction in metagenomic fragments based on the SVM algorithm. BMC Bioinformatics 2013, 14, S12.

[39] Mommen, G. P. M., van de Waterbeemd, B., Meiring, H. D., Kersten, G. et al., Unbiased selective isolation of protein Nterminal peptides from complex proteome samples using phospho tagging (PTAG) and TiO2-based depletion. Mol. Cell. Proteomics 2012, 11, 832–842.

[25] Nielsen, H., Engelbrecht, J., Brunak, S., von Heijne, G., Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 1997, 10, 1– 6. [26] Petersen, T. N., Brunak, S., von Heijne, G., Nielsen, H., SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods 2011, 8, 785–786. [27] Armengaud, J., Hartmann, E. M., Bland, C., Proteogenomics for environmental microbiology. Proteomics 2013, 13, 2731– 2742. [28] de Groot, A., Dulermo, R., Ortet, P., Blanchard, L. et al., Alliance of proteomics and genomics to unravel the specifici-

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

¨ [40] Venne, A. S., Vogtle, F. N., Meisinger, C., Sickmann, A., Zahedi, R. P., Novel highly sensitive, specific, and straightforward strategy for comprehensive N-terminal proteomics reveals unknown substrates of the mitochondrial peptidase Icp55. J. Proteome Res. 2013, 12, 3823–3830. [41] Kim, J. S., Dai, Z. Y., Aryal, U. K., Moore, R. J. et al., Resinassisted enrichment of N-terminal peptides for characterizing proteolytic processing. Anal. Chem. 2013, 85, 6826–6832. [42] Wilson, C. H., Indarto, D., Doucet, A., Pogson, L. D. et al., Identifying natural substrates for dipeptidyl peptidases 8 and 9 using terminal amine isotopic labeling of substrates

www.proteomics-journal.com

2646

E. M. Hartmann and J. Armengaud

(TAILS) reveals in vivo roles in cellular homeostasis and energy metabolism. J. Biol. Chem. 2013, 288, 13936–13949. ˇ V., Lamerz, J., Ducret, A., Cutler, P., Qualitative im[43] Guryca, provement and quantitative assessment of N-terminomics. Proteomics 2012, 12, 1207–1216. [44] Bienvenut, W. V., Sumpton, D., Lilla, S., Martinez, A. et al., Influence of various endogenous and artefact modifications on large-scale proteomics analysis. Rapid Commun. Mass Spectrom. 2013, 27, 443–450. [45] Link, A. J., Robison, K., Church, G. M., Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12. Electrophoresis 1997, 18, 1259–1313. [46] Ferguson, J. T., Wenger, C. D., Metcalf, W. W., Kelleher, N. L., Top-down proteomics reveals novel protein forms expressed in Methanosarcina acetivorans. J. Am. Soc. Mass Spectrom. 2009, 20, 1743–1750. [47] Kim, J. S., Song, J. S., Kim, Y., Park, S. B., Kim, H. J., De novo analysis of protein N-terminal sequence utilizing MALDI signal enhancing derivatization with Br signature. Anal. Bioanal. Chem. 2012, 402, 1911–1919. [48] Christie-Oleza, J. A., Miotello, G., Armengaud, J., Highthroughput proteogenomics of Ruegeria pomeroyi: seeding a better genomic annotation for the whole marine Roseobacter clade. BMC Genomics 2012, 13, 73. [49] Pawar, H., Renuse, S., Khobragade, S. N., Chavan, S. et al.,

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2014, 14, 2637–2646 Neglected tropical diseases and omics science: proteogenomics analysis of the promastigote stage of Leishmania major parasite. OMICS 2014, 18, 499–512. [50] de Groot, A., Roche, D., Fernandez, B., Ludanyi, M. et al., RNA sequencing and proteogenomics reveal the importance of leaderless mRNAs in the radiation-tolerant bacterium Deinococcus deserti. Genome Biol. Evol. 2014, 6, 932–948. [51] Zhong, J., Cui, Y., Guo, J., Chen, Z. et al., Resolving chromosome-centric human proteome with translating mRNA analysis: a strategic demonstration. J. Proteome Res. 2014, 13, 50–59. [52] Kuersten, S., Radek, A., Vogel, C., Penalva, L. O. F., Translation regulation gets its ‘omics’ moment. Wiley Interdiscip. Rev. RNA 2013, 4, 617–630. [53] Van Damme, P., Gawron, D., Van Criekinge, W., Menschaert, G., N-terminal proteomics and ribosome profiling provide a comprehensive view of the alternative translation initiation landscape in mice and men. Mol. Cell Proteomics 2014, 13, 1245–1261. [54] Bonissone, S., Gupta, N., Romine, M., Bradshaw, R. A., Pevzner, P. A., N-terminal protein processing: a comparative proteogenomic analysis. Mol. Cell Proteomics 2013, 12, 14–28. [55] Martinez, A., Traverso, J. A., Valot, B., Ferro, M. et al., Extent of N-terminal modifications in cytosolic proteins from eukaryotes. Proteomics 2008, 8, 2809–2831.

www.proteomics-journal.com

N-terminomics and proteogenomics, getting off to a good start.

Proteogenomics consists of the annotation or reannotation of protein-coding nucleic acid sequences based on the empirical observation of their gene pr...
431KB Sizes 1 Downloads 5 Views