Compact variant-rich customized sequence database and a fast and sensitive database search for efficient proteogenomic analyses.

2742

DOI 10.1002/pmic.201400225

Proteomics 2014, 14, 2742–2749

RESEARCH ARTICLE

Compact variant-rich customized sequence database and a fast and sensitive database search for efficient proteogenomic analyses Heejin Park1 , Junwoo Bae1 , Hyunwoo Kim1 , Sangok Kim2 , Hokeun Kim3 , Dong-Gi Mun3 , Yoonsung Joh1 , Wonyeop Lee1 , Sehyun Chae4 , Sanghyuk Lee2 , Hark Kyun Kim5 , Daehee Hwang4∗ , Sang-Won Lee3 and Eunok Paek1∗ 1

Department of Computer Science and Engineering, Hanyang University, Seoul, Republic of Korea Department of Life Science and Ewha Research Center for Systems Biology, Ewha Womans University, Seoul, Republic of Korea 3 Department of Chemistry, Research Institute for Natural Sciences, Korea University, Seoul, Republic of Korea 4 Department of New Biology and Center for Plant Aging Research, Institute for Basic Science, DGIST, Daegu, Republic of Korea 5 National Cancer Center, Goyang, Republic of Korea 2

In proteogenomic analysis, construction of a compact, customized database from mRNA-seq data and a sensitive search of both reference and customized databases are essential to accurately determine protein abundances and structural variations at the protein level. However, these tasks have not been systematically explored, but rather performed in an ad-hoc fashion. Here, we present an effective method for constructing a compact database containing comprehensive sequences of sample-specific variants—single nucleotide variants, insertions/deletions, and stop-codon mutations derived from Exome-seq and RNA-seq data. It, however, occupies less space by storing variant peptides, not variant proteins. We also present an efficient search method for both customized and reference databases. The separate searches of the two databases increase the search time, and a unified search is less sensitive to identify variant peptides due to the smaller size of the customized database, compared to the reference database, in the target-decoy setting. Our method searches the unified database once, but performs targetdecoy validations separately. Experimental results show that our approach is as fast as the unified search and as sensitive as the separate searches. Our customized database includes mutation information in the headers of variant peptides, thereby facilitating the inspection of peptide-spectrum matches.

Received: May 21, 2014 Revised: September 25, 2014 Accepted: October 10, 2014

Keywords: Bioinformatics / Early onset gastric cancer / Peptide identification / Proteogenomics / Sequence database

Additional supporting information may be found in the online version of this article at the publisher’s web-site

Correspondence: Dr. Sang-Won Lee, Department of Chemistry, Korea University, Seoul, 136-701, Republic of Korea E-mail: sw [email protected] Fax: 82-2-3290-3603 Abbreviations: CDS, coding DNA sequence; FDR, false discovery rate; FPKM, fragments per kilobase of exon per million fragments mapped; PSM, peptide-spectrum match; NGS, next-generation sequencing; SNV, single nucleotide variant

C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Colour Online: See the article online to view Figs. 1–4 in colour.

∗ Additional corresponding authors: Daehee Hwang and Eunok Paek,

E-mails: [email protected] and [email protected]

www.proteomics-journal.com

2743

Proteomics 2014, 14, 2742–2749

1

Introduction

Huge amounts of next-generation sequencing (NGS) data for various human diseases have been accumulated. From these sequencing data, many structural variations (singlenucleotide variants, insertions/deletions, stop-codon mutations, fusion genes, etc.) have been suggested as potential drivers in disease pathogenesis [1–3]. Confirmation of the structural variations at the protein level has been attempted to assess their functional roles in disease pathogenesis. Thus, in MS-based proteomic analysis, there has been a significant need for detecting the structural variations in proteins. In the conventional MS-based approaches, peptide sequences are usually identified by matching MS/MS spectra to a reference protein sequence database such as UniProt database. However, this conventional proteomic approach cannot detect the structural variations due to the absence of the sequences of the variants in the reference database. Proteogenomic studies have attempted to construct a database that can be used to confirm structural variations at proteome level, by translating public variant genome/transcriptome sequences or NGS data. Excellent surveys can be found in [4, 5]. Notably, Kim et al. presented a draft map of the human proteome by using public variant sequence databases [6]. Woo et al. developed a method to construct a compact proteogenomic database from largescale RNA-seq data [7]. Furthermore, researchers developed a customized, sample-specific database that contains variant sequences based on NGS data and searched both the customized and reference databases to identify and quantify variant proteins together with their original forms. Wang et al. developed a customized database containing the sequences for expressed original and variant proteins derived from RNAseq data [8]. The variant protein sequences were identified by translating the genomic sequences of the variants detected in the sequencing data. They found that a customized database detected more original proteins than the reference protein database along with variant proteins. However, construction of a customized database needs to determine a number of parameters for translation of the structural variant sequences. Furthermore, it would be desirable to use both the customized and reference databases to sensitively identify original and variant proteins. One can search the two databases separately and then combine the search results. However, it increases the search time considering a huge overlap between the two databases. Alternatively, one can search the unified database of the two after removing the duplicate sequences. However, it can be less sensitive to variant proteins, due to the smaller size of the customized database than that of the unified database, using the target-decoy validation. These aspects have not been systematically explored, but rather determined in an ad-hoc fashion. Here, we first present an effective method for constructing a compact database extensively including sequences of sample-specific variants (SNVs, Indels, and stop-codon mutations) derived from Exome-seq and RNA-seq of C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

gastric cancer and adjacent normal tissues. Despite the comprehensive variant sequences, the customized database occupies less space by including variant peptides, not variant proteins. We also present an efficient method for searching a unified database of the customized and reference databases. In this method, the unified database was searched once, but target-decoy validations were performed separately for the two databases. The results demonstrate that our method is as fast as the unified search and as sensitive as the separate searches. Finally, our customized database provides mutation information in the headers of variant peptides, which effectively assists the inspection of peptide-spectrum matches.

2

Materials and methods

2.1 Exome- and mRNA-seq experiments Gastric cancer, adjacent normal tissue, and blood samples were collected from a 38-year-old female patient who underwent distal gastrectomy at Asan Medical Center in 2013. This study was approved by the Institutional Review Board (IRB) of Asan Medical Center, and the patient signed an IRB-approved consent form. The tumor was a diffuse-type gastric adenocarcinoma in histology. Based on the histology, tumor and adjacent normal regions were validated, and tumor and normal masses were macro-dissected. Genomic DNA was isolated from the frozen tumor and buffy coat (normal control) using DNeasy blood and tissue kit (Qiagen, Valencia, CA). Total RNA was isolated from the tumor and adjacent normal samples using mirVana kit (Ambion, Carlsbad, MA). Exome-seq was performed using 1 ␮g of DNA from tumor and buffy coat samples for library preparation. SureSelect Human All Exon V4+UTRs kit (Agilent Technologies, Santa Clara, CA) was used for the exome capture. The captured DNA was sequenced using the Illumina HiSeq 2000 to generate pairedend sequence reads. For mRNA sequencing, the library was prepared using 1 ␮g of DNAse I-treated total RNA using Illumina TruSeq kit (Illumina Cat. No. 1502062, San Diego, CA), and the paired-end sequencing was performed using the Illumina HiSeq 2000.

2.2 Global proteome profiling of gastric cancer and adjacent normal samples Portions of the same gastric tissues (cancer and normal) that were used for exome and RNA-sequencing were processed to produce peptide samples for global proteome profiling as described in Supporting Information. A total of 400 ␮g (200 ␮g each from cancer and normal tissue) of the peptides were labeled with 4-plex iTRAQ reagent (AB Sciex, Foster City, CA). The two normal peptide samples (100 ␮g each) were double labeled with 114 and 116 iTRAQ reagents, respectively, while the two cancer peptide samples (100 ␮g each) were labeled with 115 and 117 iTRAQ reagents, respectively, according www.proteomics-journal.com

2744

H. Park et al.

to a manufacturer’s instruction. After the iTRAQ labeling, the four labeled peptide samples were pooled and the pooled sample was immediately subjected to the basic pH reversephase fractionation where the 96 fractions were pooled into 24 noncontiguously concatenated peptide fractions as previously described ([9], Supporting Information). Capillary RPLC-MS/MS experiments were performed on each of 24 fractions as described in Supporting Information.

2.3 Analyses of Exome-seq and RNA-seq data Exome sequence reads were mapped to the reference genome Human GRCh 37.71 (April 11th , 2013) using Bowtie2 (version 2.1.0, [10]) with “end-to-end” alignment option. Low quality bases were removed and clipped by Sickle (version 1.2, https://github.com/ucdavis-bioinformatics/sickle) before the mapping process. PCR duplicates were removed using Picard (version 1.92, http://picard.sourceforge.net). The Genome Analysis Toolkit (GATK, version 1.67, [11]) was used for local realignment and recalibration. We used TopHat (version 2.0.7, [12]) to map RNA-seq reads to the reference genome Human GRCh 37.71 with the default parameters and then used Cufflinks (version 2.1.1, [13]) to estimate transcript abundance in fragments per kilobase of exon per million fragments mapped (FPKM) for 192 882 transcripts. Novel transcripts were ignored by –G option. We selected expressed transcripts using FPKM>1 as previously described [14]. Initial SNV candidates were identified separately for tumor and normal samples using GATK UnifiedGenotyper [15] with the default options. To remove false variant calls, we filtered out SNV candidates that were not supported by RNA-seq reads. Similarly, we first identified Indels separately for tumor and normal samples using Dindel (version 1.01, [16]) and GATK UnifiedGenotyper and then filtered out the ones that were not observed within 15 bp offsets in RNA-seq data. Finally, somatic mutations were independently predicted by MuTect (version 1.1.4, [17]) and Strelka (version 1.0.7, [18]) using the default options.

2.4 Search of a unified database Prior to the protein database search, the precursor masses of all LC-MS/MS datasets were mass corrected and refined by PE-MMR [19]. Briefly, the precursor mass corrected mgf files for LC/MS/MS datasets were generated by PE-MMR software (available at http://omics.pnl.gov/software/PEMMR.php) and were subjected to database searches using MS-GF+ search engine (v.9881) [20], against each of three targetdecoy protein databases, from UniProt DB (released May 2013; 90 191 entries), customized DB and unified DB, respectively. Semitryptic option and the precursor mass tolerance of 10 ppm were used for MS-GF+ search. Static modifications of iTRAQ (144.102063 Da) on N-termini and lysine and of carbamidomethylation (57.021460 Da) on cysteine and a vari C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2014, 14, 2742–2749

able modification for oxidation (15.994915 Da) on methionine were used. After searching the unified DB, we performed the targetdecoy validations [21] separately for the customized and UniProt DBs as shown in Supporting Information Fig. 1. First, two decoy distributions were constructed for the decoy sequences in UniProt and customized DBs, respectively. To this end, we divided peptide-spectrum matches (PSMs) into three groups: PSMs whose matched peptides were included only in the customized DB (CDB-only PSMs) or only in the UniProt DB (UniProt-only PSMs) and finally PSMs whose matched peptides were included in both DBs (common PSMs). The decoy distribution for the UniProt DB was then constructed using the decoy PSMs of both UniProtonly and common PSMs, while the decoy distribution for the customized DB was obtained from the decoy PSMs of both CDB-only and common PSMs. Second, for CDB-only PSMs, false discovery rates (FDRs) were calculated using the decoy distribution for the customized DB. Similarly, FDRs for UniProt-only PSMs were computed using the UniProt decoy distribution. For each of common PSMs, we computed two FDRs using the two decoy distributions and chose the smaller Q-value for the PSM. Finally, we selected the PSMs with FDR ࣘ 0.01.

3

Results and discussion

3.1 Compact variant-rich customized sequence database Our customized sequence database consists of expressed proteins, variant peptides, and contaminants. The overall scheme for construction of the customized database is summarized in Fig. 1. First, to identify expressed proteins, the abundance measures, FPKMs, for the 192 882 transcripts in Ensembl 71 were calculated by Cufflinks after aligning RNA-seq reads to the reference genome (Human GRCh 37.71) (Materials and methods; Fig. 1, left in magenta panel). Of 192 882 transcripts, 34 098 and 29 235 with FPKM > 1 were selected as expressed transcripts in tumor and adjacent normal samples, respectively (Supporting Information Fig. 2). For each expressed transcript, we then obtained the corresponding expressed protein by retrieving its translated protein from UniProt using the Ensembl-UniProt mapping table. Of the expressed transcripts, 433 and 383 in tumor and adjacent normal samples, respectively, were not mapped to UniProt proteins. The expressed protein for each of such transcripts was obtained by translating the coding DNA sequence (CDS) of the transcript. Second, variant peptides were generated from nonsynonymous SNVs, Indels, and stop-codon mutations identified from Exome- and RNA-seq data. Using the Exome-seq data from tumor and adjacent normal samples, we first identified 133 471 and 154 491 SNV candidates using GATK UnifiedGenotyper [11] (Fig. 1, right in magenta panel). We then filtered out the SNVs not observed in RNA-seq data, resulting www.proteomics-journal.com

Proteomics 2014, 14, 2742–2749

2745

Figure 1. Overview of the workflow for construction of the customized DB and the DB search of MS/MS spectra. For the construction of the customized DB (left panel), we first identified expressed proteins from RNA-seq data as mapped coding transcripts with FPKM > 1 (see arrows from “RNA-seq” to “Expressed Proteins”) and then variant peptides from exome-seq data as RNA-seq supported mutations in coding regions (see arrows from “Exome-seq” to “Variant Peptides”). For the DB search of MS/MS spectra (right panel), we first combined the customized and UniProt DBs (unified DB), searched the unified DB using MSGF+, and applied target-decoy methods separately for the customized and UniProt DBs (see arrows from “MS/MS Spectra” to “Peptide ID”). Software tools used for individual analyses are shown. See text for detailed descriptions.

in 36 518 and 34 034 SNVs (25.5 and 18.9% of the initial SNVs) for tumor and normal samples, respectively (Supporting Information Fig. 2). The exome-seq data were used because it had more uniform depth profile than RNA-seq, thus making variant identification more robust. Next, we identified 197 and 184 Indels from tumor and normal samples, respectively, using Dindel [16] and GATK UnifiedGenotyper (Materials and methods; Supporting Information Fig. 2). Standard variant calls involving no comparison of tumor and normal samples tend to miss somatic SNVs. Finally, we thus additionally identified 153 somatic SNVs by comparing the Exomeseq data of tumor and normal samples using Mutect [17] and Strelka [18]. Of them, 144 were not identified by GATK UnifiedGenotyper (Supporting Information Fig. 2). For SNVs among the identified variants, we first extracted a CDS for each transcript with the SNVs, applied all the SNVs to the CDS, and then translated the resultant CDS into a variant protein (Fig. 2A). Variant peptide sequences were then retrieved from the variant protein as follows: Each variant sequence was centered at the mutated amino acid and extended to both directions allowing a certain number of mis C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

cleavages (we used three miscleavages in our experiment; see T→A SNV and Y→N amino acid change in Fig. 2A). The extension with three miscleavages was long enough to generate a sufficient number of possible peptides containing the mutated amino acid. Next, variant peptides with Indels were generated. Unlike SNV calls, the number of Indel candidates was a lot smaller, and we applied each Indel independently even when they were called for a single transcript. Peptides were then translated from Indel positions using the same miscleavage parameter. Figure 2B shows how two different peptide sequences were translated from a single transcript with two Indels. As is the case in Fig. 2B, applying Indel often results in a translated sequence with a frame change, thus a completely new peptide sequence might be added to the customized DB. SNVs and Indels can occur in stop codons, called stop-codon mutations, which may change them into nonstop codons (e.g. X→W change in Fig. 2C). In this case, a variant sequence was centered at the mutated stop codon and used the same three miscleavages for the left extension. However, the right translation stopped when (1) a new stop codon was found, (2) the read count


2746

H. Park et al.

Proteomics 2014, 14, 2742–2749

Figure 2. Variant peptides generation. (A) SNV: For each CDS (yellow) with SNVs, all the SNVs (e.g. C and T in light blue) in the CDS were applied, and the CDS was translated into a variant protein. Variant peptide sequences (green) were then retrieved from the variant protein. Each variant sequence is centered at the mutated amino acid (e.g. S or N in green box) and extended to both directions allowing a certain number of miscleavages (three miscleavages in this study). (B) Indel: each Indel was independently applied (CC insertion in left box and C deletion in right box) even when they were called for a single transcript (two Indels in this example) and then translated. The frame change occurred, resulting in new peptide sequences different from their original ones. (C) Stop-codon mutation: a variant sequence was centered at the mutated stop codon (X→W) and extended three miscleavages for the left direction, but up to the stopping point as described in text for the right direction.

at the translated positions was below a certain threshold, or (3) the number of extended amino acids was more than 20. All these variant peptide sequences were added to the customized database. Despite the sample-specific variant sequences, our customized database occupies less space than the reference protein database. The UniProt database includes 90 250 sequence entries with total 35 819 085 amino acids, while our cus C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

tomized database includes 32 402 sequence entries with total 10 534 333 amino acids (Table 1). Our customized database includes only expressed proteins and variant peptides, not variant proteins. The inclusion of variant peptides saves the database search space significantly because a mutation generates only one variant peptide in many cases, but can generate several redundant variant proteins due to protein isoforms. Compared with 13 240 544 amino acids in the


2747

Proteomics 2014, 14, 2742–2749 Table 1. Summary of DB search results

Database

Database size (# Entry/AA)

Identified Identified Protein PSM peptides groupsa)

All proteinsa)

# of mutated peptides (SNV/in-del)

UniProt Customized UniProt+ Customized Unified Ib) Unified IIc)

90 250/35 819 085 32 402/10 534 333 90 250 + 32 402/46 353 418 102 164/36 718 104 102 164/36 718 104

671 720 659 553 684 839 675 145 686 311

39 481 14 313 41 337 43 024 43 996

— 847/8 847/8 825/8 850/8

156 542 154 557 161 649 157 530 162 318

8388 8496 8102 8395 8572

Search Time (h)

76.39 60.51 136.90 78.89 78.89

a) ࣙ 2 peptide hits. b) Unified DB search with a single null distribution. c) Unified DB search with dual null distributions.

variant protein database, the space requirement of the variant peptide database was only 1/16 (819 655 amino acids). Finally, to sensitively identify original and variant proteins, a unified sequence database was generated by merging

UniProt and customized database after removing duplicate sequences. Furthermore, our customized database provides mutation information in the headers of variant peptides, which

Figure 3. Annotated MS/MS spectra for identified deletion (A), insertion (B), and SNV (C). Precursor isolation purity (PIP) is the ratio of total intensity of precursor isotopic envelope to total intensity in isolation window.



2748

H. Park et al.

Proteomics 2014, 14, 2742–2749

Figure 4. Distributions of target and decoy hits in the separate and unified searches. (A) UniProt PSMs from the unified DB search with two nulls. (B) CDB PSMs from the unified DB search with two nulls. (C) All PSMs from the unified DB search with one null. (D) All PSMs from UniProt DB search. (E) All PSMs from CDB search. (F) Venn diagram showing the peptide ID comparison among Unified DB search with two null, CDB search, and UniProt DB search. (G) Venn diagram showing the peptide ID comparison among unified DB search with one null, CDB search, and UniProt DB search.

effectively assists the inspection of peptide-spectrum matches. For instance, it can be easily seen that the amino acid T was mutated from A due to an SNV, by consulting the header displayed at the top of the spectrum shown in Fig. 3C. We also provide an output so that peptide-genome matches can be visualized in Genome Browser. For peptides shown in Fig. 3, their peptide-genome matches are displayed in Genome Browser with their mutations highlighted (Supporting Information Fig. 3). It should be noted that when a duplicate sequence is removed, its header information is transferred to the header of the remaining duplicate sequence. 3.2 Fast and sensitive search of the unified database Previous proteogenomic studies have used both the customized and reference DBs to sensitively identify the original and variant proteins. The existing search methods can be categorized into (1) the search of the customized DB only (CDB_only), (2) the separate searches of the customized and reference DBs followed by combining the search results from the two DBs (CDB+UniProtDB), and (3) the single search of the unified DB of the two DBs (UDB_single). The separate searches (CDB_only and CDB+UniProtDB) apply the target-decoy validation [21] for the two DBs separately, while the unified search (UDB_single) applies it only once for the unified DB. However, these methods can suffer from false negatives due to misused cutoffs (e.g. FPKM > 1 for expressed transcripts) or parameters for variant identification tools (CDB_only), long search time (CDB+UniProtDB), and reduced sensitivity to variant proteins due to the smaller size C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

of the customized DB than that of the reference DB within the unified DB (UDB_single). Thus, we developed an efficient and effective method for the use of the unified DB. Our method searches the unified DB once, but performs the target-decoy validations separately for the customized and reference DBs (see Materials and methods), thereby resulting in the two lists of the peptides identified with FDR < 0.01 separately from the two DBs (Fig. 1, yellow panel and Supporting Information Fig. 1). To assess the validity of our search method, we compared the search results from our method with those from the existing methods mentioned above (Table 1). The size of the unified DB (102 164 entries/36 718 104 AAs) used in our method and UDB_single method was larger than either of the customized (32 402 entries/10 534 333 AAs) or UniProt DB (90 250 entries/35 819 085 AAs), but smaller than the sum of the two DBs (UniProt+Customized DB; 90 250 + 32 402 entries/46 353 418 AAs). The search time was proportional to the DB size. Our method had a similar search time (78.89 h) to that of the search of UniProt DB (UniProtDB_only; 76.39 h). Despite the reduced search time, compared to the CDB+UniProtDB method, our method identified higher numbers of nonredundant (NR) peptides (162 318) and also proteins (43 996) with ࣙ 2 NR peptides comparable to the CDB+UniProtDB method (161 649 NR peptides and 41 337 proteins) (Table 1). It also identified more peptides and proteins than CDB_only, UniProtDB_only, and UDB_single methods (Unified I in Table 1). Furthermore, our method identified variant peptides, whose number is comparable to


Proteomics 2014, 14, 2742–2749

CDB_only or CDB+UniProtDB methods, and more than UDB_single method. The identified variant peptides include deletion (Fig. 3A), insertion (Fig. 3B) and SNV (Fig. 3C). Such high sensitivity to both the original and variant peptides can be ascribed to the use of different targetdecoy FDR 1% cutoffs for the customized and UniProt DBs (Fig. 4A and B), resulting in two lists of the peptides from the two DBs separately. This procedure is supposed to mimic two target-decoy FDR cutoffs for the two DBs (Fig. 4D and E) in UniProt+Customized DB method, but to use a decreased target-deocy FDR cutoff specifically for the customized DB to increase the sensitivity to variant peptides, compared to the FDR cutoff in UDB_single method (Fig. 4C). To understand the relationships among the identified peptides from all the methods, we compared the peptides identified from our method with those by the other methods (Fig. 4F). Our method identified almost (>99%) all peptides identified by CDB+UniProtDB method. It failed to identify only 0.03% of peptides (50 peptides) identified by UniProt_only and 0.22% of the ones (332 peptides) by CDB_only. In contrast, UDB_single method failed to identify the 4403 peptides identified by CDB+UniProtDB method (Fig. 4G), which is larger than the FDR 1% of the peptides identified by UDB_single method (i.e. significant false-negative error rate). Taken together, all these data demonstrate that our method is fast and sensitive to both the original and variant peptides, compared to the other methods. This research was supported by the Proteogenomics Research Program (NRF-2012M3A9B9036675, NRF2012M3A9B9036676) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning. Institute for Basic Science (CA1308) and NRF programs (NRF-2012M3A9D1054452, NRF-2014M3C7A1046047) were also acknowledged. The authors declare no competing financial interest.

4

References

[1] The Cancer Genome Atlas Research Network, Integrated genomic analyses of ovarian carcinoma. Nature 2011, 474, 609– 615. [2] The Cancer Genome Atlas Network, Comprehensive molecular portraits of human breast tumors. Nature 2012, 490, 61–70. [3] The Cancer Genome Atlas Network, Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012, 487, 330–337. [4] Wang, X., Zhang, B., Integrating genomic, transcriptomic, and interactome data to improve peptide and protein identification in shotgun proteomics. J. Proteome Res. 2014, 13, 2715–2723. [5] Hernandez, C., Waridel, P., Quadroni, M., Database construction and peptide identification strategies for proteogenomic studies on sequenced genomes. Curr. Top Med. Chem. 2014, 10, 425–434


2749 [6] Kim, M., Pinto, M. S., Getnet, D., Nirujogi, S. R. et al., A draft map of the human proteome. Nature 2014, 509, 575–581. [7] Woo, S., Cha, S., Merrihew, G., He, Y. et al., Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res. 2014, 13, 21–28. [8] Wang, X., Slebos, R. J. C., Wang, D., Halvey, P. J. et al., Protein identification using customized protein sequence databases derived from RNA-Seq data. J. Proteome Res. 2012, 11, 1009–1017. [9] Wang, Y., Yang, F., Gritsenko, M. A., Wang, Y. et al., Reversedphase chromatography with multiple fraction concatenation strategy for proteome profiling of human MCF10A cells. Proteomics 2011, 11, 2019–2026. [10] Langmead, B., Salzberg, S. L., Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [11] McKenna, A., Hanna, M., Banks, E., Sivachenko, A. et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20, 1297–1303. [12] Trapnell, C., Pachter, L., Salzberg S. L., TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25, 1105–1111. [13] Trapnell, C., Hendrickson, D. G., Sauvageau, M., Goff, L. et al., Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 2013, 31, 46–53. [14] Graveley, B. R., Brooks, A. N., Carlson, J. W., Duff, M. O. et al., The developmental transcriptome of Drosophila melanogaster. Nature 2011, 471, 473–479. [15] DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V. et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011, 43, 491–498. [16] Albers, C. A., Lunter, G., MacArthur, D. G., McVean, G. et al., Dindel: accurate indel calls from short-read data. Genome Res. 2011, 21, 961–973. [17] Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A. et al., Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013, 31, 213–219. [18] Saunders, C. T., Wong, W. S. W., Swamy, S., Becq, J. et al., Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics 2012, 28, 1811–1817. [19] Shin, B., Jung, H. J., Hyung, S. W., Kim, H. et al., Postexperiment monoisotopic mass filtering and refinement (PE-MMR) of tandem mass spectrometric data increases accuracy of peptide identification in LC/MS/MS. Mol. Cell Proteomics 2008, 7, 1124–1134. [20] Kim, S., Mischerikow, N., Bandeira, N., Navarro, J. D. et al., The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search. Mol. Cell Proteomics 2010, 9, 2840–2852. [21] Elias, J. E., Gygi, S. P., Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 2010, 604, 55–71.


Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification.

SW#db: GPU-Accelerated Exact Sequence Similarity Database Search.

The French National Alzheimer database: a fast growing database for researchers and clinicians.

A comprehensive and scalable database search system for metaproteomics.

Molecule database framework: a framework for creating database applications with chemical structure search capability.

Crescendo: A Protein Sequence Database Search Engine for Tandem Mass Spectra.

SESAM: a relational database for structure and sequence of macromolecules.

Efficient HPLC method development using structure-based database search, physico-chemical prediction and chromatographic simulation.

Protein sequence database.

GRASPx: efficient homolog-search of short peptide metagenome database through simultaneous alignment and assembly.

WGDB: Wood Gene Database with search interface.

The PIR protein sequence database.

Sequence database versioning for command line and Galaxy bioinformatics servers.

GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array.

Fast and accurate database searches with MS-GF+Percolator.

Understanding the statistics and limitations of large database analyses.

Colil: a database and search service for citation contexts in the life sciences domain.

The PRO-ACT database: design, initial analyses, and predictive features.

Diaretinopathy database -A Gene database for diabetic retinopathy.

Molecular marker database for efficient use in agricultural breeding programs.

The PIR-International Protein Sequence Database.

The International Nucleotide Sequence Database Collaboration.

Quality Control of Biomedicinal Allergen Products - Highly Complex Isoallergen Composition Challenges Standard MS Database Search and Requires Manual Data Analyses.

Mouse Genome Database: From sequence to phenotypes and disease models.