Subscriber access provided by SUNY DOWNSTATE

Article

A proteogenomic analysis of Trichophyton rubrum aided by RNA sequencing Xingye Xu, Tao Liu, Xianwen Ren, Bo Liu, Jian Yang, Lihong Chen, Candong Wei, Jianhua Zheng, Jie Dong, Lilian Sun, Yafang Zhu, and Qi Jin J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 14 Apr 2015 Downloaded from http://pubs.acs.org on April 14, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 50

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A proteogenomic analysis of Trichophyton rubrum aided by RNA sequencing

Xingye Xu†,‡, Tao Liu†,‡, Xianwen Ren†,‡, Bo Liu†, Jian Yang†, Lihong Chen†, Candong Wei†, Jianhua Zheng†, Jie Dong†, Lilian Sun†, Yafang Zhu†, Qi Jin*†



MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese

Academy of Medical Sciences & Peking Union Medical College, Beijing, China

*(Q.J.) E-mail: [email protected]. TelePhone: +86-10-67877732. Fax: +86-10-67877736.

1

ACS Paragon Plus Environment

Page 2 of 50

Page 3 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Abstract Infections caused by dermatophytes, Trichophyton rubrum (T. rubrum) in particular, are among the most common diseases in humans. In this study, we present a proteogenomic analysis of T. rubrum based on whole-genome proteomics and RNA-Seq studies. We confirmed 4,291 expressed proteins in T. rubrum and validated their annotated gene structures based on 35,874 supporting peptides. In addition, we identified 323 novel peptides (not present in the current annotated protein database of T. rubrum) that can be used to enhance current T. rubrum annotations. A total of 104 predicted genes supported by novel peptides were identified, and 127 gene models suggested by the novel peptides that conflicted with existing annotations were manually assigned based on transcriptomic evidence. RNA-Seq confirmed the validity of 95% of the total peptides. Our study provides evidence that confirms and improves the genome annotation of T. rubrum and represents the first survey of T. rubrum genome annotations based on experimental evidence. Additionally, our integrated proteomics and multi-sourced transcriptomics approach provides stronger evidence for annotation refinement than proteomic data alone, which helps to address the dilemma of “one-hit wonders” (uncertainties supported by only one peptide). KEYWORDS: Proteogenomics, genome annotation, dermatophyte, Trichophyton rubrum.

2

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1. INTRODUCTION Dermatophytes are a group of highly specialized filamentous fungi that exclusively invade keratinized tissue.1, 2 The infections that are caused by these pathogenic fungi are among the most common diseases in humans, affecting 10–20% of the global population.3 Examples of dermatophytes include the fungi responsible for tinea pedis (athlete's foot) and tinea capitis. Although dermatophytic infections are rarely life threatening, their high incidence, prevalence, frequency of relapse, and their associated morbidity present an important public health concern.4, 5

The treatment of dermatophyte infections produces a tremendous economic burden that has

been estimated at over $500 million US dollars per year worldwide.6 T. rubrum accounts for at least 60% of all dermatophyte infections, and it is therefore the most common fungal pathogen in the world.7-9 Despite their prevalence, dermatophyte infections and their roles in disease are not well understood. Recently, the full genome sequences of T. rubrum and six other dermatophyte species were released and annotated.1, 10 The availability of these genomes and their corresponding annotations should help investigators to obtain insight into the molecular mechanisms of these medically important fungi. Therefore, it is very important that the genome annotations are accurate. Large-scale genome annotation is frequently imprecise, particularly when the annotation is considerably dependent on computer algorithms. To annotate genomes accurately, experimental investigation is required. Proteogenomics, an emerging field at the junction of genomics and proteomics, has gained importance in recent years. The use of protein information to generate 3

ACS Paragon Plus Environment

Page 4 of 50

Page 5 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

genome annotations is based on the conventional bottom-up proteomics approach and is achieved by searching annotated and customized databases to identify supporting and novel peptides that confirm and improve the current annotations.11 This approach has been used successfully for both prokaryotic and eukaryotic genome annotation, e.g., for the validation and refinement of the genome annotations of Arabidopsis,12 Caenorhabditis elegans,13 Pristionchus pacificus,14 Candida glabrata,15 and Mycobacterium tuberculosis.16 Transcriptomics is a powerful tool that has been used to facilitate genome annotation.17, 18 Hypothetical gene expression patterns that are supported by proteomic data may be further corroborated by transcriptomic evidence.19 In previous years, Expressed Sequence Tags (ESTs) were the gold standard for transcript discovery.20 In recent years, the emergence of next-generation sequencing has greatly increased the availability of sequences by generating millions of reads in a relatively short time frame.21, 22 The use of high-throughput deep-sequencing technology to interpret the transcriptome, which is termed RNA-Seq, represents a promising approach for the large-scale analysis of gene expression.23 T. rubrum is considered to be a good model system for studying the pathogenic filamentous fungi that affect humans. The T. rubrum genome was sequenced using a whole-genome shotgun approach with Sanger technology. A total of 624 contigs were grouped into 36 scaffolds (supercontigs), comprising a sequenced genome of 22.53 Mb. The T. rubrum protein-coding genes were predicted with the gene prediction programs FGENESH, GENEID and GeneMark-ES, as well as by EST-based annotation.10 Version 2 of the annotated T. rubrum 4

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

genome, which contains 8,725 predicted genes, was completed in September 2010. However, the current annotation of the T. rubrum genome, which is largely dependent on computational algorithms and supplemented with partial EST evidence, need to be further improved. In this study, we used transcriptome-assisted proteogenomics to survey the annotation of the T. rubrum genome. Our results confirmed that 49% of the annotated genes are expressed, and we validated the annotated gene structures, including translational start sites, based on 35,874 supporting peptides. In addition, we obtained 323 novel peptides that can be used to improve the genome annotation of T. rubrum. A total of 201 novel peptides were explained by a minimal set of 104 predicted genes. Moreover, 127 potential revision patterns for T. rubrum annotated genes, which were suggested by the novel peptides, were manually designated based on transcriptomic evidence. In total, 95% of the total peptides were positively supported by RNA-Seq.

2. EXPERIMENTAL SECTION 2.1. Strain culture The T. rubrum (strain BMU 01672) used in this study was obtained from the Research Center for Medical Mycology, Peking University, Beijing, China. T. rubrum was cultured on potato glucose agar at 28°C for 2-3 weeks to produce conidia and in YPD liquid medium at 28°C with constant shaking (200 rpm) to culture mycelia. Conidia were harvested at 4°C with distilled water and filtered twice through a 70-µm-pore nylon filter to remove hyphal fragments. Mycelia were harvested and washed thoroughly with distilled water to remove growth medium. The collected cells were frozen at -80°C for further experimentation. 5

ACS Paragon Plus Environment

Page 6 of 50

Page 7 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

2.2. Cell lysis and pre-fractionation of the protein samples The conidia were thoroughly pulverized in a mortar with liquid nitrogen. The powder was incubated in denaturing buffer (7 M urea, 2 M thiourea, 2% CHAPS and 10 mM Tris-HCl (pH 8.0)) for 30 min with occasional vortexing. The supernatant and cell debris were collected separately after centrifugation at 13,000 rpm for 20 min at 4°C. The debris was considered to be a cell wall fraction, and the supernatant contained cytoplasmic and membranous proteins (labeled as ‘mixed fraction’). The hyphae were resuspended in 10 mM Tris-HCl (pH 7.4) containing a protease inhibitor mixture (Roche) and then disrupted using a BeadBeater (Biospec Products) as previously described.24 The lysate was collected and ultracentrifuged at 150,000 x g for 3 h at 4°C. The supernatant was collected and regarded to be the cytoplasmic fraction. The debris was washed twice with 10 mM Tris-HCl (pH 7.4) to remove residual cytoplasm and resuspended in denaturing extraction buffer (7 M urea, 2 M thiourea, 2% CHAPS and 10 mM Tris-HCl (pH 8.0)) with occasional vortexing. This resuspended solution was centrifuged at 13,000 rpm for 20 min. The soluble fraction was considered to be the plasma membrane fraction, and the insoluble fraction was labeled as the cell wall fraction. 2.3. Digestion of the insoluble cell wall fraction The insoluble cell wall fractions of the conidia and hyphae were washed four times, twice with extraction buffer and twice with 25 mM Tris-HCl (pH 8.5), and then resuspended in 25 mM Tris-HCl (pH 8.5). The proteins were reduced by incubation with 20 mM dithiothreitol (DTT) at 6

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

37°C for 1 h and alkylated by incubation with 55 mM iodoacetic acid (IAA) for 30 min at room temperature in the dark. The proteins were digested to peptides using trypsin (Promega) at a trypsin/protein ratio of 1:50 (w/w) overnight at 37°C. After the reaction was complete, the mixtures were centrifuged at 13,000 rpm for 20 min to collect the aqueous phase containing the digested peptides, which was then evaporated to dryness in a vacuum concentrator. The peptides were then re-dissolved in 0.1% formic acid (FA) prior to the LC-MS/MS analysis. 2.4. 1-D electrophoresis and in-gel digestion of the soluble fractions The three soluble fractions (the conidial mixed fraction, the hyphal cytoplasmic fraction and the hyphal membranous fraction) were separated on 12% SDS-PAGE gels. Whole lanes were excised into 1-2-mm-thick pieces from top to bottom and then processed for in-gel enzymatic digestion as described by Arjan de Groot et al.25 The tryptic digests were dried and reconstituted in 0.1% formic acid (FA) and stored at -80°C until the MS analysis. 2.5. LC-MS/MS analysis All of the peptide mixture fractions were separated using a nanoAcquity ultraperformance LC system (Waters) equipped with a C18 reversed-phase microcapillary trapping (nanoAcquity Symmetry C18, 5µm, 180µm × 20 mm) and an analytical column (nanoAcquity BEH300 C18, 1.7µm, 100µm × 100 mm). The flow rate is 400 nl/min with a gradient of 1-40% solvent B (A = 0.1% formic acid; B = 100% acetonitrile and 0.1% formic acid, v/v) for 120 min.

7

ACS Paragon Plus Environment

Page 8 of 50

Page 9 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The peptides that were eluted from the column were electro-sprayed directly into a high-resolution LTQ Orbitrap Velos mass spectrometer (Thermo Fisher Scientific) using a nanoelectrospray ion source. A full range mass scan (with mass window of 300-1500 m/z for precursor ion) was acquired with a high resolution of 60,000 using the orbitrap mass analyzer; then, 20 data-dependent MS/MS spectra were acquired using the linear ion trap. The 20 most abundant precursor ions were consecutively isolated and fragmented using collision-induced dissociation (CID) at 35% normalized collision energy. The precursors selected for MS/MS were dynamically excluded from fragmentation for 60 sec. Three biological replicates were performed. 2.6. Database construction The MS spectra were searched in three local databases. The T. rubrum annotated protein database (Anno Pro DB), version 2, was downloaded from the Dermatophyte Comparative Database at the Broad Institute (http://www.broadinstitute.org/annotation/genome/dermatophyte_comparative/MultiDownloads. html). A total of 8,725 annotated protein sequences are listed in Anno Pro DB. The T. rubrum genome was downloaded from the same URL, and a 6-frame database (6-Frame DB) was constructed by translating the entire T. rubrum genome with the getorf program in the EMBOSS (European Molecular Biology Open Software Suite) package (v6.5.7) with the default settings. In total, 1,073,986 translated sequences were obtained. An in silico predicted database (In Silico DB) was generated using the gene-finding program AUGUSTUS (2.5.5)26 (with an ab initio 8

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 50

prediction model and parameters that were adjusted for fungal annotations). In total, 6,241 predicted protein sequences were generated. Commonly observed contaminants were appended to each database. For Sequest searching, a decoy database was generated by reversing the peptide sequences without changing their C-termini. For Mascot searching, a decoy database was built comprising random sequences with the same average amino acid composition as the target database. The decoy databases were concatenated with the target databases to calculate the False Discovery Rate (FDR). 2.7. Database searching and peptide identification The raw files generated by the MS/MS analysis were searched against the three local databases described above using Proteome Discoverer (version 1.3, Thermo Scientific) with the search algorithms Mascot (V 1.27, Matrix Science) and Sequest (V 1.20, Thermo Scientific). The database searches were performed with the following parameters: total intensity threshold: 100; minimum peak count: 8; S/N threshold: 2; maximum missed cleavages: 2; mass tolerance for precursor ions: 5 ppm; mass tolerance for fragment ions: 0.8 Da; static modifications: carboxymethylation of Cys (+58.005 Da); and dynamic modifications: Met oxidation (+15.995 Da) and N-terminal acetylation (+42.011 Da). A target-decoy-based FDR cut-off of < 1% was set at the peptide level. All the peptides used in the subsequent analyses were required to have unique genomic locations. Because novel peptides should have stronger evidence than known peptides,27 we set the criteria for the posterior error probability (PEP) to be PEP < 0.01 for the supporting peptides and PEP < 0.002 for the novel peptides. 9

ACS Paragon Plus Environment

Page 11 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

2.8. RNA extraction and sequencing The total RNAs of conidia and mycelia were isolated using the RNeasy Plant Mini Kit (QIAGEN), and Poly (A)+ mRNA was purified using the Oligotex mRNA Mini Kit (QIAGEN) according to the manufacturer’s instructions. Single- and double- strand synthesis was conducted with the Superscript Double-Strand cDNA Synthesis Kit (Invitrogen). cDNA quality was evaluated on 1.5% agarose gels and using an Agilent 2100 Bioanalyzer. Library preparation and sequencing on a 454 GS FLX (Roche) and the Illumina GAIIx (Illumina) platform were as described by Ren et al .28 2.9. Data Analysis To categorize the identified proteins, the protein sequences were compared to the Gene Ontology (GO) database using Blast2GO.29 The transmembrane domains of the proteins were predicted using TMHMM 2.0 (http://www.cbs.dtu.dk/services/TMHMM/). The presence and location of the signal peptides in the protein sequences were predicted using SignalP 4.1 (http://www.cbs.dtu.dk/services/SignalP/). 2.10. Workflow for genome annotation The peptides identified in the Anno Pro DB, termed ‘supporting peptides’, were used to validate the existing gene models. To refine the genome annotations, the genome-translated 6-Frame DB and the In Silico DB were searched to discover novel peptides. Novel un-spliced peptides (novel peptides aligned to continuous genome sequences) were identified by comparing the peptides 10

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

from the 6-Frame DB with the sequences from the Anno Pro DB to exclude identical results. The novel spliced peptides (the novel peptides that span intron(s)) were identified by comparing the peptides generated from the In Silico DB with the sequences from the Anno Pro and 6-Frame DBs to exclude identical results. The novel peptides were classified as either intragenic or intergenic peptides according to whether they overlapped with an existing gene model, including the 5’ or 3’ untranslated region (UTR: the 1,000 bases upstream or downstream of an annotated gene). The intragenic and intergenic peptides were clustered to predict potential gene models. The peptides in each annotated gene model were clustered separately as intragenic peptide clusters. Because no accurate UTR assignments were available, the peptides that were located 1,000 bp upstream and downstream of each gene were considered to represent novel genes or used to correct known gene models; for this reason, the peptides located in the UTRs were also included in the clustering of the intergenic peptides. A distance of 2 kb, which is greater than over 99% of the annotated introns in T. rubrum, was selected for clustering the intergenic peptides. Each cluster, together with 10,000 bp on both sides of the flanking regions of the genome, was retrieved, and the AUGUSTUS gene-prediction program was used to find potential gene models. The peptides from each cluster provided hints for gene prediction, and multiple alternative transcripts were allowed per model. To identify the peptides that were validated by the transcriptomic data, the read coverage was calculated for each peptide using in-house scripts. The read coverage for each intronic and intergenic region (according to the current annotation) was calculated, and the resulting values 11

ACS Paragon Plus Environment

Page 12 of 50

Page 13 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

were ranked from highest to lowest. The 90th percentile of the rank was set as the background (noise) baseline for the transcriptomics data. For the positive validation of peptides based on the transcriptome, we selected a cut-off value of 5× the baseline value over the entire length of the peptide (5 times higher than the 90th percentile of the intronic/intergenic read coverage), as has been previously suggested.30 A summary of the workflow is illustrated in Figure 1. 2.11. Genome-wide mapping and visualization The T. rubrum genome (with the track name of Supercont_XXX) and annotated transcripts (with the transcript ID of TERG_XXX) were visualized using the UCSC Genome Browser, which was implemented in-house. In total, 35,874 supporting peptides (with the peptides ID of ANPXXX) generated from the Anno Pro DB, 280 novel un-spliced peptides (with the peptides ID of NUSPXXX) from the 6-Frame DB, and 43 novel spliced peptides (with the peptides ID of NSPXXX) from the In Silico DB were mapped back to the genome using BLAT31 and shown as separated tracks. A total of 14,081,210 76-nt and 19,734,411 80-nt Illumina reads were obtained and mapped to the genome using the TopHat software, with a minimum intron size of 10. Junctions generated from TopHat mapping (with the track name of Illumina) of Illumina results were displayed using the Genome Browser. For the 454 measurements, 317,624 reads with a median length of 442 nt were obtained. In addition, 11,085 unique ESTs from T. rubrum, based on 40,617 original reads previously produced by our group,32 were deposited in the T. rubrum Expression Database,33 which is available at http://www.mgc.ac.cn/TrED/ . Both the 454 (with the track name of 454) 12

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and ESTs reads (with the track name of ESTs) were mapped to the genome using BLAT, allowing for visualization. Depth-of-read coverage of transcriptomics was shown with the track name of read coverage. The refined gene models of the current annotation (with the track name of Refined TERG_XXX or Novel gene model), was also displayed locally, allowing manual inspection of regions of interest.

3. RESULTS In this study, we sought to acquire comprehensive experimental validation of the genes that are expressed in T. rubrum; such evidence can, in turn, be used to improve the genome annotation of T. rubrum. In total, 4,912,411 mass spectra from the 123 LC-MS/MS runs were obtained and used for database searches. Using the criteria of PEP < 0.01 for supporting peptide identification and PEP< 0.002 for novel peptide identification, a total of 35,874 supporting peptides and 323 novel peptides with unique genomic loci were identified for further analysis. The transcriptomic data from the Illumina, 454, and EST reads were used to validate the peptides and assign potential gene models based on the novel peptides (see section 2.10 and Figure 1). 3.1. Validation of the current annotation 3.1.1. Validating gene expression To validate the expression of the predicted genes in T. rubrum, the raw data were searched against the Anno Pro DB. A total of 35,874 supporting peptides were obtained, corresponding to 13

ACS Paragon Plus Environment

Page 14 of 50

Page 15 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

4,291 proteins, which accounted for 49% of the predicted proteins in T. rubrum. When we required the presence of at least two distinct peptides per protein to confirm gene expression, 3,542 proteins, corresponding to 41% of the annotated proteome, were validated (see Suppl. Table 1 in the Supporting Information). On average, the number of peptides per protein was 8, and 30% of the identified proteins were covered by more than 10 unique peptides. Figure 2A is a plot of the peptides per protein vs. the number of proteins. A total of 843 proteins (20% of the identified proteins) had > 50% sequence coverage. Figure 2B depicts the correlation between the protein coverage and the number of proteins. In total, 3047 proteins were identified in the conidia fraction, and 4217 proteins were identified in the hyphae fraction. The identified proteins in each conidia and hyphae sub-fraction are shown in Supplementary Figure 1 in the Supporting Information. The identified proteins were searched in the Gene Ontology (GO) database, and annotation information was obtained for 2,879 of the identified proteins. The GO annotations and categories of the identified proteins are listed in Supplementary Table 2 in the Supporting Information. Membrane proteins were predicted by TMHMM 2.0, and the signal peptides in the protein sequences were predicted by SignalP 4.1. A total of 631 predicted membrane proteins were obtained after excluding signal peptide interference (see Suppl. Table 3 in the Supporting Information). 3.1.2. Hypothetical protein identification

14

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 50

Robust characterization of hypothetical proteins requires experimental and bioinformatic approaches.34, 35 The first step in this process is to validate the expression of the hypothetical proteins in live cells, and the most effective methods to achieve this are large-scale proteomic approaches. In the current release of the T. rubrum genome, 4,825 of the 8,725 proteins are annotated as hypothetical proteins. The peptides derived from these loci were used to confirm that their corresponding genes code for proteins. In our study, we identified 1,449 hypothetical proteins, representing 30% of the proteins annotated as hypothetical proteins in the T. rubrum proteome. A total of 1,052 hypothetical proteins were identified based on two distinct peptides. These proteomic data provide direct experimental evidence for the expression of these proteins in T. rubrum. 3.1.3. Gene structure validation Due to the depth of the data provided by current proteomics research, the peptides identified in our study can provide valuable evidence for validating gene structure annotations. For example, exons can be directly confirmed by the peptides that map within their boundaries, and spliced peptides can reveal the introns they span.36 A total of 9,850 exons could be confirmed based on the peptides that were identified in our study, which corresponds to 37% of the annotated exons in T. rubrum. In addition, 2,832 peptides overlapped two neighboring exons, spanning the spliced junctions of 2,658 introns (see Suppl. Table 1 in the Supporting Information). These peptides not only validated the annotated 15

ACS Paragon Plus Environment

Page 17 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

introns, but they also confirmed their adjacent exons. For example, for the gene TERG_03414 (2,305 aa), which had the highest peptide coverage (104 unique mapped peptides), three of its four exons were confirmed by 102 un-spliced peptides, and two of its three introns were validated by two spliced peptides (see Suppl. Figure 2 in the Supporting Information). 3.1.4. Translational start site identification Assigning the correct translational start site (TSS) is critical for genome annotation because the TSS is associated with protein function, localization, and transcriptional regulation.37 TSS validation based on traditional N-terminal sequencing often requires large quantities of the protein; moreover, the N-terminus cannot be accessed when it is blocked by modifications.38 MS-based proteomic approaches facilitate large-scale TSS confirmation by identifying the modified N-terminal peptides. N-terminal methionine excision (NME) and N-terminal acetylation (NTA) are the most common modifications of N-termini. We mapped 586 non-redundant peptides to 558 annotated proteins at either the +1 or +2 position (see Suppl. Table 4 in the Supporting Information). Of these, 470 N-terminal peptides that mapped to the penultimate position of the initiator methionine residue were assumed to have undergone N-terminal methionine excision by methionine aminopeptidase (MAP).39 Consistent with other studies, our results showed that MAP is highly specific for methionine, followed by A, S, P, T, G, and V, suggesting that NME is most likely to occur when an amino acid next to an N-terminal methionine has a small side chain at the +2 position, which is typically G, A, T, P, S, V, or C.30, 40

Figure 3A shows the distribution of the N-terminal amino acids that follow the NME events 16

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

detected in our study, and Figure 3B shows the relative efficiency of NME for specific amino acids at the +2 position. Nearly 72% of the N-terminal peptides identified in our analysis were N-acetylated; this percentage is higher than that found in yeast.41 Of the acetylated N-terminal peptides, 18% of them contained an acetylated initiator methionine, and 82% of them were acetylated at the +2 position after methionine excision, frequently A (52%) or S (37%), which is consistent with previous studies.19 3.2. Improvement of the current annotation 3.2.1. Identification of novel peptides The peptides that did not conform to annotated gene models, termed ‘novel peptides’, were used to improve the current annotation of the T. rubrum genome. The peptides that were identified based on the genome-translated 6-Frame DB were filtered against the Anno Pro DB to obtain novel un-spliced peptides. The peptides that were identified based on the In Silico DB were filtered against the Anno Pro and 6-Frame DBs to obtain novel spliced peptides. In total, we obtained 323 non-redundant novel peptides, including 280 novel un-spliced peptides and 43 novel spliced peptides (see Suppl. Tables 5 and 6 in the Supporting Information). Based on the annotation, intergenic peptides were defined as those that did not overlap existing gene models, including UTRs, while intragenic peptides conflicted with the current gene model.

17

ACS Paragon Plus Environment

Page 18 of 50

Page 19 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

To rule out the false positives associated with chemically modified forms of known peptides,27 we blasted the 323 novel peptides against the Anno Pro DB to find sequences with high degrees of homology to known sequences. A total of 8 novel peptides (indicated in green in Suppl. Tables 5 and 6 in the Supporting Information) were found to have close homologies to known sequences. We compared the mass shifts of these 8 novel peptides and their homology sequences to the possible modifications listed in the UNIMOD database (http://www.unimod.org/). No possible modifications were found for the 8 peptides. 3.2.2. Gene-model prediction All the novel peptides were clustered and submitted as hints to the AUGUSTUS gene-prediction program (see section 2.10). A total of 201 novel peptides were described by a minimal set of 104 gene models. Of these predicted gene models, 95 were refinements of known genes and 9 were considered to be novel (see Suppl. Table 7 in the Supporting Information). The protein sequences for all the predicted gene models are listed in Supplementary Table 8 in the Supporting Information. We compared the protein sequences of the 9 genes that were predicted to be novel to a non-redundant protein sequence (nr) database using BLAST, and orthologs were found for seven of the predicted genes (see Suppl. Table 9 in the Supporting Information). 3.2.3. Manual inspection of possible modifications of the annotated gene models We manually inspected the intragenic peptides that might prompt correction of the annotated gene models and used the evidence provided by aligned RNA data (Illumina, 454 and ESTs) to assign potential gene models. A total of 127 revision patterns of existing genes were assigned 18

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

based on138 novel peptides. The variations in the gene models based on the transcriptomic data versus the annotations could be classified into six groups: (1) different splicing sites, including different donor sites, different acceptor sites, and different donor and acceptor sites, (2) missing exons, (3) missing introns, (4) extra introns, (5) shortened exons, and (6) incorrectly split genes (see Table 1). Moreover, 11 of these gene models were considered to be alternative isoforms wherein both the revised and annotated gene models were supported by RNA evidence. Furthermore, based on the depth of read coverage displayed by the Genome Browser, 8 of the identified events may have been caused by a 1–2-bp insertion in the genome sequence that caused a frameshift and the consistent insertion of an erroneous intron to correct for the frameshift. Twenty-one of the revision events implied by the novel peptides could be explained by an incomplete genome sequence in the internal area of the supercontigs (unknown genome sequences between contigs in the internal of each scaffold), which consistently resulted in a truncation of the gene model or the introduction of erroneous introns. The types of revision patterns corresponding to each peptide are listed in Supplementary Tables 5 and 6 in the Supporting Information. In addition to the revised gene models predicted by AUGUSTUS, a number of annotated genes were corrected by manual inspection. Combining AUGUSTUS prediction and manual inspection, a total of 161 gene annotations were refined, and 9 novel genes that were missing from the current annotation of the T. rubrum genome were discovered. 3.2.4. Classification of the novel peptides that suggest annotation improvements 19

ACS Paragon Plus Environment

Page 20 of 50

Page 21 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A total of 280 novel un-spliced peptides and 43 novel spliced peptides were identified in our study. The categories of these peptides and the peptide number in each category are listed in Tables 2 and 3. 3.2.4.1. Peptides that mapped to the intronic regions of annotated genes In total, 29 peptides were mapped to the intronic regions of 27 annotated genes (see Suppl. Table 5A in the Supporting Information), providing evidence for the expression of these regions. Peptide NUSP0042 (GALPVGQNSPQKPPFGLYAEQLSGTAFTAPR) mapped to the first intron of TERG_00068, which suggests that a region of this intron is translated into a protein. The refined gene model supported by the three transcriptomic tracks (Illumina, ESTs and 454) suggests that the annotated exon should be extended as shown in Figure 4A. The homologous species Microsporum gypseum strain CBS 118893 exhibits an identical amino sequence in this revised region. 3.2.4.2. Peptides that mapped to exons and were translated in a different frame than the existing annotation A total of 12 peptides were consistent with the boundaries of annotated exons but were translated from different annotated reading frames (see Suppl. Table 5B in the Supporting Information), suggesting that frame corrections may be warranted in the current annotations of T. rubrum. Peptide NUSP2597 (FIASVTPGIEHDFTMGV) was located in the second exon of the TERG_04104 gene (Figure 4B). A homology analysis suggested that this peptide and several 20

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

nearby amino acids are conserved in related species. This result indicates that a potential frame correction of the peptide’s surrounding regions is warranted. The tracks for the EST and 454 reads both indicate a missing intron upstream of NUSP2597 for the TERG_04104 gene. 3.2.4.3. Peptides that mapped to exon-intron boundaries Fifty peptides mapped to the junctions of annotated exons and introns in a manner that suggests differing exon boundaries from those in current T. rubrum annotations (see Suppl. Table 5C in the Supporting Information). Peptide NUSP2952 (DFLQEYVVNLLQSAFK) overlapped the ninth exon of the TERG_04689 gene (see Figure 4C). The refined gene model, which is supported by Illumina, EST and 454 data, displays the corrected exon-intron junction of TERG_04689. Homology evidence revealed a higher conservation of the refined TERG_04689 with orthologous proteins in Trichophyton tonsurans CBS 112818 and Arthroderma gypseum CBS 118893 compared with the annotated gene in T. rubrum, and the revised region contains an exon sequence that is identical to an exon sequence in the orthologs. 3.2.4.4. Peptides that suggested an extension of gene boundaries A total of 12 peptides suggested an N-terminal extension of their corresponding genes, and three peptides suggested a C-terminal extension (see Suppl. Table 5D in the Supporting Information). Peptide NUSP2940 (LNIVDTPGYGDQVNNDR) spans the start site of the TERG_04676 gene in a different reading frame from that in the annotation. The three transcriptomic tracks 21

ACS Paragon Plus Environment

Page 22 of 50

Page 23 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(Illumina, ESTs and 454) indicate a different exon-intron boundary and an extension of the N-terminus of the TERG_04676 gene, as illustrated in Figure 4D, which suggests the presence of an upstream translational start site. The start site of this annotated gene is near a > 3-kb internal gap (unknown genome sequences between contigs in the internal of each scaffold) in supercontig 2.4. The RNA reads were split by this gap and mapped across it on both sides, suggesting a continuous gene model (see Figure 4E). The refined gene model “Refined TERG_04676T0” spans this gap with an intron support structure that was deduced from the transcriptomic tracks. A comparative genomic analysis revealed that “Refined TERG_04676T0” shares close homology with Microsporum gypseum CBS 118893, which has a translational start site that is identical to the refined gene model and is similar in length to the refined gene model (see Suppl. Figure 3 in the Supporting Information). The refined TERG_04676 gene is consistent with the external evidence, including peptide NUSP2939 (SHVGFDSITSQIEK), which is located near the start site of the predicted gene. 3.2.4.5. Peptides that were mapped to UTRs A total of 38 peptides were mapped to the 5’ UTRs of known gene models, and 39 peptides were mapped to 3’ UTRs, indicating a possible translation of the corresponding regions (see Suppl. Tables 5E and 5F in the Supporting Information). The peptides that mapped to these regions suggest either an extension of the annotated genes or novel genes that were missed during annotation. In addition, we found peptides that were located within 100 bp of annotated gene termini, which most likely indicate extensions of gene boundaries. 22

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For example, peptides NUSP1508 (GVIPLSTYLK) and NUSP1509 (KGVIPLSTYLK) are encoded in the 5’ UTR of the known gene TERG_02376, as illustrated in Figure 4F. The refined gene model “Refined TERG_02376T0” predicts N-terminal extension and the presence of an additional intron in the TERG_02376 gene. The three transcriptomic tracks (Illumina, ESTs and 454), which all contain the same splice site, further corroborate the existence of this revised region. A second example involves the peptides NUSP4089 (FPDQPSDFISLLSLAQGR), NUSP4090 (ERFPDQPSDFISLLSLAQGR) and NUSP4091 (VADPLFAIVIGSSAAFIR), which are encoded in the 3’UTR of the gene TERG_06499, supporting a novel gene model that is absent from current annotations (see Figure 4G). The three transcriptomic tracks (Illumina, ESTs and 454) show identical splicing forms in this novel gene, and close homologies are observed with related species. 3.2.4.6. Intergenic peptides In our analysis, we identified 102 un-spliced peptides located in the intergenic regions of known gene models (see Suppl. Table 5G in the Supporting Information). The peptides that are located in intergenic regions generally indicate novel genes that were missed during annotation. Occasionally, intergenic peptides may also suggest extensions of the current gene models. Figure 4H shows an example in which 29 peptides were mapped to the intergenic region between gene TERG_00696 and TERG_00697, and four peptides were mapped to the 3’UTR of TERG_00697. The refined gene model suggests an extension of the TERG_00697 gene, as 23

ACS Paragon Plus Environment

Page 24 of 50

Page 25 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

shown in “Refined TERG_00697T0”, which accounts for these 33 peptides. This refined gene is homologous to the 5,087-aa protein nonribosomal siderophore peptide synthase (SidC) of Arthroderma benhamiae CBS 112371. 3.2.4.7. Spliced novel peptides that implied novel splicing forms A total of 38 out of 43 spliced peptides were supported by transcriptomic evidence as having identical spliced forms (see Suppl. Table 6 in the Supporting Information). These peptides are classified in agreement with changes to the current annotations, as shown in Table 3. Examples of these classes are illustrated in Figures 5A–5E. The novel spliced peptides may indicate either the need for corrections of the existing gene models or the existence of alternative splicing isoforms. Peptide NSP031 (IAHQVVDHVHFHMIPKPNEPEGLGIGWPAK) illustrates the absence of a 5-aa exon between the fourth and fifth exons of TERG_01163 (see Figure 5F). EST and 454 evidence corroborated the absence of an exon from TERG_01163, whereas the Illumina data validated the annotated splicing, which strongly suggests that these two splicing patterns co-exist in T. rubrum. A comparative genomic analysis confirmed the existence of both of these splicing models in the annotations of homologous species such as T. tonsurans CBS 112818 and T. equinum CBS 127.97. 3.3. Complete validation of the identified peptides by transcriptomic evidence

24

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

All three of the read sources (Illumina, 454, and ESTs) were mapped to their relative locations on the T. rubrum genome to validate the identified peptides (see section 2.10), and 96% and 90% of the supporting and novel peptides, respectively, were validated. In total, 95% of the peptides were validated by the transcriptome. Moreover, we also examined the evidence for positive validation by single peptides. We found that 90% of the single supporting peptides associated with annotated genes, and 91% of the single novel peptides associated with predicted gene models were supported by transcriptomic evidence, providing strong evidence for the coding potential of these peptides. A total of 1613 peptides were not supported by the transcriptomics data due to the complementary nature of proteomics and transcriptomics technology. Of these 1613 peptides, 1580 of them are supporting peptides, and the remaining 33 peptides are novel peptides. All the peptides that were confirmed by transcriptomics are listed in Supplementary Table 10 in the Supporting Information. The mass spectra of the single peptides and other parameters are available in Supplementary Table 11 in the Supporting Information.

4. DISCUSSION T. rubrum has been studied for many years as a model organism for pathogenic filamentous fungi in humans. Previous attempts to use genetic tools to decipher the pathogenesis of this fungus have often been hindered by the lack of a sequenced genome.42 Fortunately, the annotated genome of T. rubrum was recently released. Based on this important resource, we performed an RNA-Seq–assisted proteogenomic analysis that represents the first systematic investigation of the annotation of the T. rubrum genome. This study provides solid evidence for the expression of 25

ACS Paragon Plus Environment

Page 26 of 50

Page 27 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

annotated T. rubrum genes at the protein level, and it also permitted us to make improvements to the current annotations. To achieve maximum coverage of the expressed proteins, the proteins were separately extracted from T. rubrum at two stages of its life cycle. These proteins were further divided into fractions according to their cellular components, and LTQ-Orbitrap Velos mass spectrometry was used to generate accurate proteomic data. The MS analysis was processed at high resolution, whereas the MS/MS analysis was processed at low resolution; therefore, both a high precursor ion mass accuracy and a high scan speed were achieved simultaneously. To identify the novel peptides in our study, we used two customized databases: the 6-Frame DB and the In Silico DB. The 6-frame DB contains all the possible un-spliced sequences but fails to capture exon-exon junction peptides. To overcome this limitation of the 6-frame DB, we constructed an In Silico DB with AUGUSTUS, which was not used for genome annotation by the Broad Institute. Thus, the In Silico DB is expected to contain novel gene models from current T. rubrum annotations. By subtracting the identical sequences from the Anno Pro and 6-frame DBs, novel spliced peptides could be detected in the In Silico DB via spectra searching. In this manner, the two customized DBs were used together to identify novel peptides. Our approach to annotation improvement involved the integration of proteomics and multiple sources of transcriptomic data. Unlike the transcriptomic data, which contain sequences that might not be translated into proteins, the peptides that were identified in our experiments represent direct evidence for gene expression. For this reason, the proteomic data were initially 26

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

used to improve the current annotation. Additionally, the proteomic data were used as primary evidence due to their unique ability to correctly assign gene boundaries, identify actual open reading frames, confirm the translation of "doubtful" genes, discover post-translational modifications and assign translational start sites.30 Due to their relatively high sequence lengths and high genome coverage, the transcriptomic data strengthened the ability of the proteomic data to validate and correct gene annotations. Our procedure is particularly suitable for eukaryotic genome annotation because large numbers of spliced RNA reads can clearly display the exon–intron boundaries of actual gene structures.20 As illustrated in Figure 4, the novel peptides that we discovered implied conflicting patterns of annotation. Without using transcriptomic data, as is the case in many proteogenomics studies, the gene models implied by these novel peptides could only be predicted by computational algorithms. In our study, three sources of transcriptomic evidence were used to reveal the splicing forms of the actual gene models to confirm the predicted genes. In the cases where no predicted gene model exemplified the novel peptides, the mapped transcriptomic data provided valuable evidence to assign the correct gene models. This feature of transcriptomic data enables a visual inspection of the real gene models of the corresponding regions suggested by the novel peptides. This method is particularly helpful when the mapped reads from different sources (Illumina, 454 and ESTs) show identical gene structures by increasing the reliability of the gene models compared with those based on a single transcriptomic source. In addition, due to the high depth that can be achieved by RNA-Seq, it is also possible to explore alternative models that 27

ACS Paragon Plus Environment

Page 28 of 50

Page 29 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

support both annotated models and modified gene models that are inferred from RNA evidence (as the example shows in Figure 5F). It is widely believed that gene models that are supported by multiple peptides are more reliable than those supported by single peptides. In our study, we visually inspected a large number of novel single peptides together with aligned transcriptomic data to suggest annotation refinements, thereby increasing the confidence of the refinements inferred from single peptides. As shown in Figure 4A, the discovery of the single peptide NUSP0042 suggested that an intron had been erroneously annotated, and the refined gene revealed a corrected gene model with an extended second exon. The three transcriptomic tracks (Illumina, ESTs and 454) show identical patterns of this refined gene, which was implied only by a single peptide, thus increasing its reliability. Therefore, we believe that incorporating transcriptomic data may help to resolve “one-hit wonders”. In our analysis, 90% of the single supporting peptides corresponded to annotated genes, and 91% of the single novel peptides that supported predicted gene models were positively validated by RNA evidence, thereby increasing the validity of the single peptides. Our approach involved directly matching proteomics data to genome-derived databases and using transcriptomic data to validate peptides. Other options also exit; for example, the RNA-Seq-enhanced database could be used for peptide matching.43, 44 This experimentally derived database contains relatively few artificially created non-existing protein sequences, which increases search specificity. However, false negatives may be introduced by the RNA-Seq 28

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

database due to insufficient sequencing depth, transcript assembly errors and/or filtered transcripts with low transcriptional levels. Both of these two alternative approaches identify proteins using orthogonal data, namely, transcriptional and translational data. Integrated multi-omics analyses for genome annotation provide stronger evidence than single omic-based approaches. The genome annotations of T. rubrum and six related dermatophyte species, which were obtained as a consequence of recent advances in sequencing technology, have led to great progress in dermatophyte research. The findings described herein may be helpful for improving the genome annotations of dermatophytes with close homologies to T. rubrum. The accurate annotation of these fungal genomes paves the way for subsequent molecular studies of host colonization, invasion, and specialization.42 Together, these avenues will contribute to our understanding of the mechanisms of infection of these clinically important pathogenic fungi, which will lead to the development of improved therapeutics.6 Due to the increased attention that these fungi have received, great progress has already been achieved, and we expect dermatophyte research to advance rapidly in the near future.

ASSOCIATED CONTENT Supporting Information, this material is available free of charge via http://pubs.acs.org/ Supplementary Table 1. Complete list of proteins identified from Trichophyton_rubrum and the corresponding supporting peptides. Supplementary Table 2. GO annotation and category of the identified proteins. Supplementary Table 3. Proteins with possible transmembrane helices. 29

ACS Paragon Plus Environment

Page 30 of 50

Page 31 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Supplementary Table 4. The list of N-terminal peptides mapped to annotated proteins at either the +1 or +2 position. Supplementary Table 5. Classification of novel un-spliced peptides. Supplementary Table 6. Novel spliced peptides represent different splicing forms of annotated genes. Supplementary Table 7. The list of predicted transcripts supported by novel peptides. Supplementary Table 8. Protein sequence of predicted transcript. Supplementary Table 9. Orthologs for the predicted novel genes. Supplementary Table 10. Supporting and novel peptides confirmed by transcriptomic evidence. Supplementary Table 11.Spectra of novel and supporting single peptides. Supplementary Figure 1. The numbers of identified protein in each sub-fraction of conidia and hyphae. Supplementary Figure 2. Validation of the structure of TERG_03414 gene. Supplementary Figure 3. Homology analysis of the refined gene “Refined TERG_04676T0”.

Data availability The complete set of mass spectrometry data (mgf file format) generated from our study has been deposited in the publicly accessible database PeptideAtlas and is available with dataset Identifier PASS00328. The raw data of RNA-Seq were deposited in the NCBI Sequence Reads Archive (SRA) with accession number SRA050365.1. The improved annotation of T. rubrum is available at the Broad Institute download page, with the placeholder of "Updated T. rubrum annotation by Xu et al", under the 'Supplementary 30

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 50

Download' section: http://www.broadinstitute.org/annotation/genome/dermatophyte_comparative/MultiDownloads.h tml.

AUTHOR INFORMATION Corresponding Author *(Q.J.) E-mail: [email protected]. Phone: +86-10-67877732. Fax: +86-10-67877736. Address: MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, No. 6, Rongjing East Street, BDA, Beijing, China.

Author Contributions ‡These authors contributed equally to this work. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

Notes The authors declare no competing financial interest.

ACKNOWLEDGMENTS We thank the Broad Institute for making the improved annotation available at their download page. We thank Ruoyu Li for providing the T. rubrum BMU 01672 strain. We also thank Lingling Wang and coworkers for the production of ESTs sequences.

31

ACS Paragon Plus Environment

Page 33 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

This work was supported by the National Nature Science Foundation of China (Grant No. 30870104), the National High Technology Research and Development Program of China (Grant No. 2012AA020303), the National Science and Technology Major Project of China (Grant No. 2013ZX10004-101).

ABBREVIATIONS ESTs, expressed sequence tags; FDR, false discovery rate; MAP, methionine aminopeptidase; MS/MS, tandem mass spectrometry; NME, N-terminal methionine excision; NTA, N-terminal acetylation; RNA-Seq, ribonucleic acid sequencing; TSS, translational start site; T. rubrum, Trichophyton rubrum; UTR, untranslated regions.

REFERENCES (1) Burmester, A.; Shelest, E.; Glockner, G.; Heddergott, C.; Schindler, S.; Staib, P.; Heidel, A.; Felder, M.; Petzold, A.; Szafranski, K.; Feuermann, M.; Pedruzzi, I.; Priebe, S.; Groth, M.; Winkler, R.; Li, W.; Kniemeyer, O.; Schroeckh, V.; Hertweck, C.; Hube, B.; White, T. C.; Platzer, M.; Guthke, R.; Heitman, J.; Wostemeyer, J.; Zipfel, P. F.; Monod, M.; Brakhage, A. A. Comparative and functional genomics provide insights into the pathogenicity of dermatophytic fungi. Genome Biol. 2011, 12 (1), R7. (2) Weitzman, I.; Summerbell, R. C. The dermatophytes. Clin. Microbiol. Rev. 1995, 8 (2), 240-259. (3) Grumbt, M.; Monod, M.; Staib, P. Genetic advances in dermatophytes. FEMS Microbiol. Lett. 2011, 320 (2), 79-86. (4) Havlickova, B.; Czaika, V. A.; Friedrich, M. Epidemiological trends in skin mycoses worldwide. Mycoses 2008, 51 Suppl 4, 2-15. 32

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 50

(5) Smijs, T. G.; Bouwstra, J. A.; Talebi, M.; Pavel, S. Investigation of conditions involved in the susceptibility of the dermatophyte Trichophyton rubrum to photodynamic treatment. J. Antimicrob. Chemother. 2007, 60 (4), 750-759. (6) Achterman, R. R.; White, T. C. A foot in the door for dermatophyte research. PLoS Pathog. 2012, 8 (3), e1002564. (7) Coloe, S. V.; Baird, R. W. Dermatophyte infections in Melbourne: trends from 1961/64 to 1995/96. Pathology 1999, 31 (4), 395-397. (8) Liu, T.; Zhang, Q.; Wang, L.; Yu, L.; Leng, W.; Yang, J.; Chen, L.; Peng, J.; Ma, L.; Dong, J.; Xu, X.; Xue, Y.; Zhu, Y.; Zhang, W.; Yang, L.; Li, W.; Sun, L.; Wan, Z.; Ding, G.; Yu, F.; Tu, K.; Qian, Z.; Li, R.; Shen, Y.; Li, Y.; Jin, Q. The use of global transcriptional analysis to reveal the biological and cellular events involved in distinct development phases of Trichophyton rubrum conidial germination. BMC Genomics 2007, 8, 100. (9) Mukherjee, P. K.; Leidich, S. D.; Isham, N.; Leitner, I.; Ryder, N. S.; Ghannoum, M. A. Clinical Trichophyton rubrum strain exhibiting primary resistance to terbinafine. Antimicrob. Agents Chemother. 2003, 47 (1), 82-86. (10) Martinez, D. A.; Oliver, B. G.; Graser, Y.; Goldberg, J. M.; Li, W.; Martinez-Rossi, N. M.; Monod, M.; Shelest, E.; Barton, R. C.; Birch, E.; Brakhage, A. A.; Chen, Z.; Gurr, S. J.; Heiman, D.; Heitman, J.; Kosti, I.; Rossi, A.; Saif, S.; Samalova, M.; Saunders, C. W.; Shea, T.; Summerbell, R. C.; Xu, J.; Young, S.; Zeng, Q.; Birren, B. W.; Cuomo, C. A.; White, T. C. Comparative genome analysis of Trichophyton rubrum and related dermatophytes reveals candidate genes involved in infection. MBio 2012, 3 (5), e00259-12. (11) Xing, X. B.; Li, Q. R.; Sun, H.; Fu, X.; Zhan, F.; Huang, X.; Li, J.; Chen, C. L.; Shyr, Y.; Zeng, R.; Li, Y. 33

ACS Paragon Plus Environment

Page 35 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

X.; Xie, L. The discovery of novel protein-coding features in mouse genome based on mass spectrometry data. Genomics 2011, 98 (5), 343-351. (12) Castellana, N. E.; Payne, S. H.; Shen, Z.; Stanke, M.; Bafna, V.; Briggs, S. P. Discovery and revision of Arabidopsis genes by proteogenomics. Proc. Natl. Acad. Sci. U. S. A. 2008, 105 (52), 21034-21038. (13) Merrihew, G. E.; Davis, C.; Ewing, B.; Williams, G.; Kall, L.; Frewen, B. E.; Noble, W. S.; Green, P.; Thomas, J. H.; MacCoss, M. J. Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations. Genome Res. 2008, 18 (10), 1660-1669. (14) Borchert, N.; Dieterich, C.; Krug, K.; Schutz, W.; Jung, S.; Nordheim, A.; Sommer, R. J.; Macek, B. Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models. Genome Res. 2010, 20 (6), 837-846. (15) Prasad, T. S.; Harsha, H. C.; Keerthikumar, S.; Sekhar, N. R.; Selvan, L. D.; Kumar, P.; Pinto, S. M.; Muthusamy, B.; Subbannayya, Y.; Renuse, S.; Chaerkady, R.; Mathur, P. P.; Ravikumar, R.; Pandey, A. Proteogenomic analysis of Candida glabrata using high resolution mass spectrometry. J. Proteome Res. 2012, 11 (1), 247-260. (16) Kelkar, D. S.; Kumar, D.; Kumar, P.; Balakrishnan, L.; Muthusamy, B.; Yadav, A. K.; Shrivastava, P.; Marimuthu, A.; Anand, S.; Sundaram, H.; Kingsbury, R.; Harsha, H. C.; Nair, B.; Prasad, T. S.; Chauhan, D. S.; Katoch, K.; Katoch, V. M.; Kumar, P.; Chaerkady, R.; Ramachandran, S.; Dash, D.; Pandey, A. Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol. Cell. Proteomics 2011, 10 (12), M111 011627. (17) Seok, J.; Xu, W.; Jiang, H.; Davis, R. W.; Xiao, W. Knowledge-based reconstruction of mRNA transcripts with short sequencing reads for transcriptome research. PLoS One 2012, 7 (2), e31440. 34

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(18) Li, Z.; Zhang, Z.; Yan, P.; Huang, S.; Fei, Z.; Lin, K. RNA-Seq improves annotation of protein-coding genes in the cucumber genome. BMC Genomics 2011, 12, 540. (19) Chaerkady, R.; Kelkar, D. S.; Muthusamy, B.; Kandasamy, K.; Dwivedi, S. B.; Sahasrabuddhe, N. A.; Kim, M. S.; Renuse, S.; Pinto, S. M.; Sharma, R.; Pawar, H.; Sekhar, N. R.; Mohanty, A. K.; Getnet, D.; Yang, Y.; Zhong, J.; Dash, A. P.; MacCallum, R. M.; Delanghe, B.; Mlambo, G.; Kumar, A.; Keshava Prasad, T. S.; Okulate, M.; Kumar, N.; Pandey, A. A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry. Genome Res. 2011, 21 (11), 1872-1881. (20) Denoeud, F.; Aury, J. M.; Da Silva, C.; Noel, B.; Rogier, O.; Delledonne, M.; Morgante, M.; Valle, G.; Wincker, P.; Scarpelli, C.; Jaillon, O.; Artiguenave, F. Annotating genomes with massive-scale RNA sequencing. Genome Biol. 2008, 9 (12), R175. (21) Desgagne-Penix, I.; Khan, M. F.; Schriemer, D. C.; Cram, D.; Nowak, J.; Facchini, P. J. Integration of deep transcriptome and proteome analyses reveals the components of alkaloid metabolism in opium poppy cell cultures. BMC Plant Biol. 2010, 10, 252. (22) Ning, K.; Fermin, D.; Nesvizhskii, A. I. Comparative analysis of different label-free mass spectrometry based protein abundance estimates and their correlation with RNA-Seq gene expression data. J. Proteome Res. 2012, 11 (4), 2261-2271. (23) Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10 (1), 57-63. (24) Lee, S. C.; Fang, T. J. Simultaneous extraction of carotenoids and transfructosylating enzyme from Xanthophyllomyces dendrorhous by a bead beater. Biotechnol. Lett. 2011, 33 (1), 109-112. (25) de Groot, A.; Dulermo, R.; Ortet, P.; Blanchard, L.; Guerin, P.; Fernandez, B.; Vacherie, B.; Dossat, C.; 35

ACS Paragon Plus Environment

Page 36 of 50

Page 37 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Jolivet, E.; Siguier, P.; Chandler, M.; Barakat, M.; Dedieu, A.; Barbe, V.; Heulin, T.; Sommer, S.; Achouak, W.; Armengaud, J. Alliance of proteomics and genomics to unravel the specificities of Sahara bacterium Deinococcus deserti. PLoS Genet. 2009, 5 (3), e1000434. (26) Stanke, M.; Steinkamp, R.; Waack, S.; Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucl. Acids Res. 2004, 32 (Web Server issue), W309-312. (27) Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 2014, 11 (11), 1114-1125. (28) Ren, X.; Liu, T.; Dong, J.; Sun, L.; Yang, J.; Zhu, Y.; Jin, Q. Evaluating de Bruijn graph assemblers on 454 transcriptomic data. PLoS One 2012, 7 (12), e51188. (29) Conesa, A.; Gotz, S.; Garcia-Gomez, J. M.; Terol, J.; Talon, M.; Robles, M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21 (18), 3674-3676. (30) Volkening, J. D.; Bailey, D. J.; Rose, C. M.; Grimsrud, P. A.; Howes-Podoll, M.; Venkateshwaran, M.; Westphall, M. S.; Ane, J. M.; Coon, J. J.; Sussman, M. R. A proteogenomic survey of the Medicago truncatula genome. Mol. Cell. Proteomics 2012, 11 (10), 933-944. (31) Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12 (4), 656-664. (32) Wang, L.; Ma, L.; Leng, W.; Liu, T.; Yu, L.; Yang, J.; Yang, L.; Zhang, W.; Zhang, Q.; Dong, J.; Xue, Y.; Zhu, Y.; Xu, X.; Wan, Z.; Ding, G.; Yu, F.; Tu, K.; Li, Y.; Li, R.; Shen, Y.; Jin, Q. Analysis of the dermatophyte Trichophyton rubrum expressed sequence tags. BMC Genomics 2006, 7, 255. (33) Yang, J.; Chen, L.; Wang, L.; Zhang, W.; Liu, T.; Jin, Q. TrED: the Trichophyton rubrum Expression Database. BMC Genomics 2007, 8, 250. 36

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 50

(34) Kolker, E.; Picone, A. F.; Galperin, M. Y.; Romine, M. F.; Higdon, R.; Makarova, K. S.; Kolker, N.; Anderson, G. A.; Qiu, X.; Auberry, K. J.; Babnigg, G.; Beliaev, A. S.; Edlefsen, P.; Elias, D. A.; Gorby, Y. A.; Holzman, T.; Klappenbach, J. A.; Konstantinidis, K. T.; Land, M. L.; Lipton, M. S.; McCue, L. A.; Monroe, M.; Pasa-Tolic, L.; Pinchuk, G.; Purvine, S.; Serres, M. H.; Tsapin, S.; Zakrajsek, B. A.; Zhu, W.; Zhou, J.; Larimer, F. W.; Lawrence, C. E.; Riley, M.; Collart, F. R.; Yates, J. R., 3rd; Smith, R. D.; Giometti, C. S.; Nealson, K. H.; Fredrickson, J. K.; Tiedje, J. M. Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. Proc. Natl. Acad. Sci. U. S. A. 2005, 102 (6), 2099-2104. (35) Lubec, G.; Afjehi-Sadat, L.; Yang, J. W.; John, J. P. Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog. Neurobiol. 2005, 77 (1-2), 90-127. (36) Brosch, M.; Saunders, G. I.; Frankish, A.; Collins, M. O.; Yu, L.; Wright, J.; Verstraten, R.; Adams, D. J.; Harrow, J.; Choudhary, J. S.; Hubbard, T. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. Genome Res. 2011, 21 (5), 756-767. (37) Rison, S. C.; Mattow, J.; Jungblut, P. R.; Stoker, N. G. Experimental determination of translational starts using peptide mass mapping and tandem mass spectrometry within the proteome of Mycobacterium tuberculosis. Microbiology 2007, 153 (Pt 2), 521-528. (38) Zheng, J.; Ren, X.; Wei, C.; Yang, J.; Hu, Y.; Liu, L.; Xu, X.; Wang, J.; Jin, Q. Analysis of the Secretome and Identification of Novel Constituents From Culture Filtrate of Bacillus Calmette-Guerin Using High-resolution Mass Spectrometry. Mol. Cell. Proteomics 2013, 12 (12), 3987-3988. (39) Gallien, S.; Perrodou, E.; Carapito, C.; Deshayes, C.; Reyrat, J. M.; Van Dorsselaer, A.; Poch, O.; Schaeffer, C.; Lecompte, O. Ortho-proteogenomics: multiple proteomes investigation through orthology and a 37

ACS Paragon Plus Environment

Page 39 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

new MS-based protocol. Genome Res. 2009, 19 (1), 128-135. (40) Frottin, F.; Martinez, A.; Peynot, P.; Mitra, S.; Holz, R. C.; Giglione, C.; Meinnel, T. The proteomics of N-terminal methionine cleavage. Mol. Cell. Proteomics 2006, 5 (12), 2336-2349. (41) Arnesen, T.; Van Damme, P.; Polevoda, B.; Helsens, K.; Evjenth, R.; Colaert, N.; Varhaug, J. E.; Vandekerckhove, J.; Lillehaug, J. R.; Sherman, F.; Gevaert, K. Proteomics analyses reveal the evolutionary conservation and divergence of N-terminal acetyltransferases from yeast and humans. Proc. Natl. Acad. Sci. U. S. A. 2009, 106 (20), 8157-8162. (42) Rivera, Z. S.; Losada, L.; Nierman, W. C. Back to the future for dermatophyte genomics. MBio 2012, 3 (6), e00381-12. (43) Kelkar, D. S.; Provost, E.; Chaerkady, R.; Muthusamy, B.; Manda, S. S.; Subbannayya, T.; Selvan, L. D.; Wang, C. H.; Datta, K. K.; Woo, S.; Dwivedi, S. B.; Renuse, S.; Getnet, D.; Huang, T. C.; Kim, M. S.; Pinto, S. M.; Mitchell, C. J.; Madugundu, A. K.; Kumar, P.; Sharma, J.; Advani, J.; Dey, G.; Balakrishnan, L.; Syed, N.; Nanjappa, V.; Subbannayya, Y.; Goel, R.; Prasad, T. S.; Bafna, V.; Sirdeshmukh, R.; Gowda, H.; Wang, C.; Leach, S. D.; Pandey, A. Annotation of the zebrafish genome through an integrated transcriptomic and proteomic analysis. Mol. Cell. Proteomics 2014, 13 (11), 3184-3198. (44) Wu, P.; Zhang, H.; Lin, W.; Hao, Y.; Ren, L.; Zhang, C.; Li, N.; Wei, H.; Jiang, Y.; He, F. Discovery of novel genes and gene isoforms by integrating transcriptomic and proteomic profiling from mouse liver. J. Proteome Res. 2014, 13 (5), 2409-2419.

38

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure legends

Figure 1. Workflow employed in our analysis. A total of 4,912,411 mass spectra were searched against an Anno Pro DB, a 6-frame DB, and an In Silico DB. Supporting peptides generated from the Anno Pro DB were used to validate annotated genes. To identify novel un-spliced peptides, peptides generated from the 6-frame DB were compared against the Anno Pro DB to exclude identical results. To identify novel spliced peptides, peptides identified from the In Silico DB were compared against the Anno Pro and 6-frame DBs. Intragenic and intergenic peptides were clustered separately for gene-model prediction. Transcriptomic data (Illumina, 454, and ESTs reads) were used to assign possible revision models suggested by the novel peptides and to validate peptide identities. Figure 2. Coverage of proteins identified in our analysis. (A). Number of peptides per protein, plotted against the number of proteins. (B). Percentage of protein sequences covered by identified peptides, plotted against the number of proteins.

Figure 3. Examination of N-terminal methionine excision (NME) identified in our study. (A). Number of N-terminal peptides after NME. This figure shows the fraction of observed N-terminally cleaved proteins with various amino acids at their N-termini (B). Relative frequency of NME for each amino acid at the +2 position. The relative frequency was calculated by dividing the number of amino acids at the excised N-terminus by the total number of amino acids at the +2 position for both cleaved and un-cleaved N-termini.

39

ACS Paragon Plus Environment

Page 40 of 50

Page 41 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4. Examples of novel un-spliced peptides that suggest modifications of existing gene models. The representation of each track is illustrated as following: SupercontXXX, supercontig ID; Read coverage, read coverage of the depicted region; TERG_XXX, transcript ID of the current annotation (version 2); Refined TERG_XXX, revised gene model for TERG_XXX; Novel gene model (in 4G), a possible novel transcript that is missed in the current annotation (version 2); NUSPXXX, peptide ID of identified novel un-spliced peptides that aligned to continuous genome sequences; Illumina, splicing junctions generated from mapped Illumina reads; ESTs, mapped ESTs reads; and 454, mapped 454 reads. The red bars of the track of SupercontXXX stand for continuous genome sequences, and the dashed line (in 4E) stands for the incomplete genome sequence in the internal portion of supercontig. For other tracks except SupercontXXX, the thick bars in each diagram stand for exons, and the thin lines stand for introns. The arrowheads (labelled on the transcripts and peptides, either on the thick bars or the thin lines) denote the direction of the strand (i.e., forward or reverse). Double circles at the terminus of an exon or intron indicate that the region extends beyond the border of the diagram. (A) Peptide NUSP0042 mapped to the intron of TERG_00068 gene, suggesting translation of this region. (B) Peptide NUSP2597, located in the exon of the TERG_04104 gene, suggests a reading frame different from that in the annotation. The “Refined TERG_04104T0” displays a missing intron closely upstream of peptide NUSP2597. (C) Peptide NUSP2952 mapped to the exon–intron junction of the TERG_04689 gene, suggesting correction of this exon–intron boundary. (D) Peptide NUSP2940 suggests N-terminus extension of the TERG_04676 gene. (E) The extended N-terminus of TERG_04676 gene is also supported by NUSP2939, Illumina and 40

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ESTs. (F) Peptides NUSP1508 and NUSP1509 encoded in the 5’UTR of the TERG_02376 gene, suggesting N-terminal extension of TERG_02376. (G) NUSP4089, NUSP4090 and NUSP4091, which are encoded downstream of the TERG_06499 gene, suggest a novel gene in that location, as diagrammed in “Novel gene model”. (H) Twenty-nine intergenic peptides encoded upstream of TERG_00697 gene, suggesting extension of TERG_00697 gene. Figure 5. Examples of novel spliced peptides that suggest modifications of existing gene models. The representation of each track is illustrated as following: SupercontXXX, supercontig ID; Read coverage, read coverage of the depicted region; TERG_XXX, transcript ID of the current annotation (version 2); Refined TERG_XXX, revised gene model for TERG_XXX; NSPXXX, peptide ID of identified novel spliced peptides which span intron(s); Illumina, splicing junctions generated from mapped Illumina reads; ESTs, mapped ESTs reads; and 454, mapped 454 reads. The red bars of the track of SupercontXXX stand for continuous genome sequences. For other tracks except SupercontXXX, the thick bars in each diagram stand for exons, and the thin lines stand for introns. The arrowheads (labelled on the transcripts and peptides, either on the thick bars or the thin lines) denote the direction of the strand (i.e., forward or reverse). Double circles at the terminus of an exon or intron indicate that the region extends beyond the border of the diagram. (A) Different donor site for the TERG_03492 gene supported by the peptide NSP087 and the three transcriptomic tracks (Illumina, ESTs and 454). (B) Different acceptor site for the TERG_01562 gene supported by peptide NSP040 and the three transcriptomic tracks (Illumina, ESTs and 454). (C) Different donor and acceptor site for the TERG_03204 gene supported by peptide NSP080 and the three transcriptomic tracks (Illumina, ESTs and 454). (D) Missing intron for the TERG_02286 gene supported by peptide NSP052 and the three transcriptomic tracks (Illumina, ESTs and 454). (E) Missing exon for the TERG_07205 41

ACS Paragon Plus Environment

Page 42 of 50

Page 43 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

gene supported by peptide NSP182 and the three transcriptomic tracks (Illumina, ESTs and 454). (F) possible alternative splicing for the TERG_01163 gene, that peptide NSP031 and two transcriptomic tracks (ESTs and 454) suggest a missing exon for TERG_01163, while the Illumina show identical splicing forms for TERG_01163.

42

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 50

TABLE

Table 1. Types of observed refinements of annotated genes. Intragenic peptides that might prompt corrections in the annotated gene models were visually inspected alongside the aligned transcriptomic data, using the UCSC Genome Browser. The inspected intragenic peptides included peptides mapped within and less than 100 bp from the annotated gene boundaries. The patterns of refinement were assigned based on evidence provided by transcriptomic data. A total of 127 revision patterns of existing genes were assigned based on138 novel peptides. Category of modification

Count

different splicing site missing exon missing intron extra intron shortened exon incorrectly split gene total

37 9 28 31 21 1 127

Table 2. Categories of novel un-spliced peptides The novel un-spliced peptides were classified according to their locations relative to annotated gene models. Peptide category

Numbers of peptides in each category

Examples for each category

Peptides mapped to the intronic regions of annotated genes Peptides mapped within exons and translated in a different frame from the existing annotation Peptides mapped to the exon-intron boundaries Peptides mapped to the gene boundaries Peptides mapped to 5’ UTRs Peptides mapped to 3’ UTRs Peptides mapped to intergenic regions

29

Figure 4A

12

Figure 4B

50 15 38 39 102

Figure 4C Figure 4D, 4E Figure 4F Figure 4G Figure 4H

43

ACS Paragon Plus Environment

Page 45 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3. Categories of novel spliced peptides The novel spliced peptides were classified according to the difference of their splicing models to the annotated gene models. Peptide category

Numbers of peptides in each category

Examples for each category

7 7 3 18 8

Figure 5A Figure 5B Figure 5C Figure 5D Figure 5E

different donor sites different accepter sites both different donor and accepter sites missing introns missing exons

44

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

44x22mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 46 of 50

Page 47 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

82x82mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

120x173mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 48 of 50

Page 49 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

120x172mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

140x112mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 50 of 50

Page 51 of 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

117x79mm (300 x 300 DPI)

ACS Paragon Plus Environment

Proteogenomic Analysis of Trichophyton rubrum Aided by RNA Sequencing.

Infections caused by dermatophytes, Trichophyton rubrum in particular, are among the most common diseases in humans. In this study, we present a prote...
1MB Sizes 3 Downloads 5 Views