JO U RN A L OF P ROTE O M ICS 1 18 ( 20 1 5 ) 6 3 –80

Available online at www.sciencedirect.com

ScienceDirect www.elsevier.com/locate/jprot

Extracting high confidence protein interactions from affinity purification data: At the crossroads☆ Shuye Pu a,⁎, James Vlasblom a,c , Andrei Turinsky a , Edyta Marcon d , Sadhna Phanse d , Sandra Smiley Trimble b , Jonathan Olsen b,d , Jack Greenblatt b,d , Andrew Emili b,d , Shoshana J. Wodak a,b,c,⁎⁎ a

Hospital for Sick Children, 555 University Avenue, Toronto, Ontario M4K 1X8, Canada Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada c Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada d Banting and Best Department of Medical Research, University of Toronto, Donnelly Centre for Cellular and Biomolecular Research, 160 College Street, Toronto, Ontario M5S 3E1, Canada b

AR TIC LE I N FO

ABS TR ACT

Available online 14 March 2015

Deriving protein–protein interactions from data generated by affinity-purification and mass spectrometry (AP–MS) techniques requires application of scoring methods to measure the

Keywords:

reliability of detected putative interactions. Choosing the appropriate scoring method has

Protein–protein interaction

become a major challenge. Here we apply six popular scoring methods to the same AP–MS

Scoring methods

dataset and compare their performance. The comparison was carried out for six distinct

Affinity purification

datasets from human, fly and yeast, which focus on different biological processes and differ

Mass spectrometry

in their coverage of the proteome. Results show that the performance of a given scoring method may vary substantially depending on the dataset. Disturbingly, we find that the high confidence (HC) PPI networks built by applying the six scoring methods to the same raw AP–MS dataset display very poor overlap, with only 1.7–4.1% of the HC interactions present in all the networks built, respectively, from the proteome-wide human, fly or yeast datasets. Various properties of the shared versus unique interactions in each network, including biases in protein abundance, suggest that current scoring methods are able to eliminate only the most obvious contaminants, but still fail to reliably single out specific interactions from the large body of spurious associations detected in the AP–MS experiments. Biological significance The fast progress in AP–MS techniques has prompted the development of a multitude of scoring methods, which are relied upon to remove contaminants and non-specific binders. Choosing the appropriate scoring scheme for a given AP–MS dataset has become a major challenge. The comparative analysis of 6 of the most popular scoring methods, presented here, reveals that overall these methods do not perform as expected. Evidence is provided that this is due to 3 closely related issues: the high ‘noise’ levels of the raw AP–MS data, the

☆ This article is part of a Special Issue entitled: Protein dynamics in health and disease. Guest Editors: Pierre Thibault and Anne-Claude Gingras. ⁎ Corresponding author. ⁎⁎ Correspondence to: S.J. Wodak, VIB Structural Biology Centre, VUB, Building E, Pleinlaan 2, 1050 Brussels, Belgium. E-mail addresses: [email protected] (S. Pu), [email protected] (S.J. Wodak).

http://dx.doi.org/10.1016/j.jprot.2015.03.009 1874-3919/© 2015 Published by Elsevier B.V.

64

JO U R N A L OF P ROTE O M ICS 1 18 ( 20 1 5 ) 6 3–80

limited capacity of current scoring methods to deal with such high noise levels, and the biases introduced using Gold Standard datasets to benchmark the scoring functions and threshold the networks. For the field to move forward, all three issues will have to be addressed. This article is part of a Special Issue entitled: Protein dynamics in health and disease. Guest Editors: Pierre Thibault and Anne-Claude Gingras © 2015 Published by Elsevier B.V.

1. Introduction Affinity purification–mass spectrometry (AP–MS) has become one of the dominant experimental approaches for high-throughput analyses of protein–protein interactions (PPIs) and protein complexes [1–5]. With the improved detection sensitivity of MS instruments, the number of hit proteins (preys) that co-purify with the target proteins (baits) and can be detected has increased significantly. However, a sizeable fraction of these preys represent spurious binders that engage in non-specific interactions [6,7]. In order to filter out such spurious interactions, scoring methods are used to estimate the reliability of individual associations, a quantity often considered as related to their specificity. These estimates are then benchmarked against a reference set of reliable known interactions (the so-called ‘Gold Standard’) and used to derive the final high confidence (HC) network that contains only PPI of an acceptable reliability level. Recently, several computational methods have been proposed for assigning reliability or confidence scores to associations detected in proteomic studies [8–13]. These scoring methods vary in many aspects (see [14] for an extensive review). Some methods consider only bait–prey interactions (spoke model), others take into account both bait–prey and prey–prey interactions (matrix model). These methods have been developed in studies that employ diverse experimental protocols and probe association landscapes that vary in coverage of the proteome, in the binding propensities of the corresponding proteins, and in the overall quality of the raw datasets. A new scoring method is usually developed on a single AP–MS dataset, and the same dataset is commonly used to compare its performance to those of extant methods. It is often unclear, therefore, if a given method shown to outperform others on a specific dataset also performs well on other datasets. Faced with a newly derived dataset, experimentalists are therefore often unable to make an informed choice about the scoring methods that is best suited for processing their data. To address this gap, we compare the performance of six of the most popular scoring methods, with an emphasis on recently devised methods that incorporate spectral counts. We analyze 3 such methods, including the Comparative Proteomic Analysis Software Suite (ComPASS) [12], the Significant Analysis of Interactome (SAINT) [8] and the Hypergeometric Spectral Counts score (HGSCore) [10]. Spectral counts are a semi-quantitative measure of protein abundance in samples [15] and their incorporation may therefore improve the reliability measure encoded in the score. On the other hand, spectral counts can be affected by inefficient protein digestion and peptide ionization [16]. Their incorporation might thus also have deleterious effects on the scoring method. To obtain a more general picture we also analyzed 3 popular scoring methods that do not utilize

spectral counts: the Purification Enrichment (PE) [9], Dice Coefficient (Dice) [13], and the Hart score (Hart) [11]. We did not evaluate several other published scoring methods, such as Mass Spectrometry Interaction Statistics (MiST) [17], Decontaminator [18], Socio-affinity score (SA) [19], Improved Socio-affinity score (ISA) [20] and Interaction Detection Based on Shuffling (IDBOS) [21]. This choice was due either to the lack of proper data (e.g., MiST requires protein intensity data and large number of replicates, Decontaminator requires multiple control purifications to build a model of contaminants and uses Mascot scores as input) or to the similarity of the methods to one of those we chose to evaluate (e.g., SA is a simpler variant of the PE score, ISA and IDBOS are similar to the more widely used SAINT and ComPASS in their use of the Spoke model). For the purpose of the present study all 6 methods were applied respectively to 6 different published raw AP–MS datasets, all of which had available spectral count data (Table 1). For the majority of the datasets (5 out of 6) a single scoring method was developed or applied by the authors to produce the final HC protein–protein interaction network. Here, all 6 methods were applied to each of the 6 datasets. Receiver Operating Characteristic (ROC) curve analysis was employed to benchmark the interactions scored by each method against literature-curated high confidence PPIs used as a reference (Gold Standard). These reference sets were retrieved from iRefWeb, a web resource for consolidated protein–protein interactions [22]. Since interacting protein pairs frequently share functions and/or cellular localizations, the similarity of the Gene Ontology (GO) [23] annotations of interacting pairs was used as an additional validation criterion. Our study describes the most thorough comparison to date of HC confidence networks derived by applying different scoring methods to the same dataset. This comparison also evaluates the extent to which the performance of different methods changes with the dataset to which they are applied. It demonstrates the importance of choosing a scoring method that is appropriate for the dataset at hand, and confirms the determinant role that features of the raw experimental AP–MS data themselves play in shaping the end result. By far the most important observation we make is that HC interactions derived by different scoring methods from the exact same raw dataset display very limited overlap. The poor overlap of HC interaction networks derived for the same organism by different experimental techniques has been previously documented [24–26] Concerns have been raised that this may stem from the fact that the derived HC networks still incorporate a non-negligible fraction of spurious (non-specific) interactions, possibly because scoring methods may be less effective than expected [27,28], especially when they are applied to noisy datasets. Our

65

JO U RN A L OF P ROTE O M ICS 1 18 ( 20 1 5 ) 6 3 –80

Table 1 – Details of the raw protein interactions data from the 6 affinity purification–mass spectrometry (AP–MS) studies analyzed in this study. Baits a Preys Purif.

Bait–prey Interactions associations (published)

TAG

Dataset

Organism

FLY [10] (HGSCore) YMP [29] (PE) CM [41] (iHGSCore) Kinome [31] (SAINT) DUBS [12] (ComPASS) AIN [30] (ComPASS)

Drosophila melanogaster

3313

4927

3488

438,557

10,969

HA

Sachhromyces cerevisiae

1385

4275

4362

215,339

13,343

TAP

Homo sapien

301

2566

547

81,808

11,464

Sachhromyces cerevisiae

260

2417

578

54,152

1844

Homo sapien

102

2855

203

58,995

1336

Homo sapien

65

2524

129

31,837

751

MS spectra Peptide search method identification score cut-off SEQUEST

0.8% (FDR)

SEQUEST

90% (p-value)

VAP c

X!TANDEM/TPP

99% (p-value)

FLAG, HA, TAP HA

Mascot

4% (FDR)

SEQUEST

2% (FDR)

HA

SEQUEST

2% (FDR)

b

For each dataset, the method in parentheses is the one that was used to generate the published high confidence (HC) networks. All studies used LC–MS/MS for peptide identification, and YMP used MALDI-TOF in addition to LC–MS/MS. For MALDI_TOF, the confidence cut-off is Z-score > 1. a The number of baits is the count of baits that can be mapped to an Entrez gene id, which may be slightly less than the number reported in the original publications. b TAP tag: CBP-TEV-ProteinA. c VAP tag: FLAG-TEV-His-StrepII.

comparative analysis offers insights into the factors that may be at play.

2. Material and methods 2.1. AP–MS datasets A total of 6 distinct datasets were obtained from published studies [10,12,29–31]. All 6 studies used affinity purification methods coupled with Mass Spectrometry (AP–MS) to identify protein complexes. The 6 datasets included information on spectral counts, which represents a rough measure of the relative abundance of the protein being identified [15,32] and collectively covered three organisms (human, fly and yeast). Spectral counts data for the Drosophila melanogaster (FLY) dataset was downloaded from the published supporting materials [10]. Spectral counts data for all other datasets were kindly provided by the respective authors. The 6 datasets are quite diverse in several aspects (see Table 1). The FLY dataset [10] was derived using a 1-step HA purification protocol. It is the largest in terms of both number of baits (3313) and number of associations (bait–prey and prey–prey), and is the only dataset that didn't target any particular biological process, although the bait and prey proteins cover only 24% and 38% of the genome, respectively. This dataset was originally scored using the HGSCore algorithm [10]. The Yeast Membrane Protein (YMP) dataset [29] is similar in size, in terms of the number of associations, to the FLY dataset. Although targeted towards membrane proteins, the prey proteins in this dataset cover a large fraction (68%) of the proteome in yeast. The YMP dataset was derived using the Tandem Affinity Purification (TAP) and originally scored with the PE scoring methods.

The third largest dataset of Table 1 is the recently published human Chromatin Modification (CM) dataset derived using the VAP tag system [33]. It targets proteins involved in chromatin modification processes that occur exclusively in the nucleus, and achieves quite a substantial coverage of nuclear proteins (36%). The three other datasets are significantly smaller. The Kinome dataset [31] focuses on kinases and phosphatases in yeast (Sachhromyces cerevisiae), and was originally scored using the SAINT algorithm. The DUBS [12] and AIN [30] datasets targeted deubiquitinases and autophage-related proteins in human, respectively, and both were originally scored using ComPASS. The Kinome and DUBS datasets involve proteins that participate in diverse cellular process.

2.2. Scoring methods and score calculations The 6 analyzed scoring methods belong to two major categories: methods that only consider bait–prey association (spoke model) and those that consider both bait–prey and prey–prey associations (matrix model). Several of the more recent methods incorporate information on spectral counts, taken as a rough measure of protein abundance [15], whereas others only consider the presence or absence of an association. The main features of the different scoring methods are briefly summarized below. Two methods, SAINT [8] and ComPASS [12] consider only bait–prey associations (spoke model) and both use information on spectral counts. We used version 2.3.1 of SAINT, which was the latest version available at the time of this study. This version implements a highly elaborate scoring method that models spectral counts associated with each prey protein across all purifications as a mixture of two Poisson distributions — one for non-specific binding and another for specific binding. Given a spectral count value, the posterior probability that it represents a

66

JO U R N A L OF P ROTE O M ICS 1 18 ( 20 1 5 ) 6 3–80

specific association between a bait and a prey is computed heuristically using the Markov chain Monte Carlo algorithm. SAINT is also able to incorporate control experiments (such as purifications with tagged GFP) and has been adopted in several studies [31,34]. But its many adjustable parameters and high computational cost pose a challenge for large-scale studies (over 1000 purifications). Some of these shortcomings have been addressed in [35], a recent significantly overhauled version of this method, not tested here. ComPASS uses a purely heuristic approach. It incorporates information on spectral counts, on the promiscuity of prey proteins (the fraction of baits that a prey co-purifies with) and the reproducibility of bait–prey associations in replicated purifications in a straightforward manner. Unlike SAINT it does not require information on protein length, which may affect the mass spectrometry readout. ComPASS has been used in three studies by its developers [12,30,36]. The PE [9], Hart [11] and HGSCore [10] scoring methods all use the matrix model. PE, the earliest of the three methods, uses an approach similar to naive Bayes, and treats bait–prey and prey–prey associations differently. It incorporates neither spectral counts nor protein length, but has two parameters that need to be estimated empirically. One is a pseudo-count that smoothes estimates of the frequency with which each prey is observed across all purifications, and the other is a rough estimate of the probability that genuine interactions will be preserved and detected in the experiments. The Hart score assumes that the number of times two proteins occur together across purifications follows a hypergeometric distribution, so that the specificity of an interaction can be gauged by the probability of the observed frequency of occurrence under this null model. In both methods, if a bait–prey association identified in purifications targeting that particular bait is also observed as prey–prey associations in purifications of other baits, the evidence for a specific association between those two proteins is strengthened. Whereas the PE score considers bait–prey and prey–prey associations separately – possibly ascribing them different confidence scores – the Hart and the HGSCore methods (described below) do not distinguish between these two types of PPI. In any of these methods, interactions between non-targeted proteins can therefore be discovered in addition to bait–prey interactions, if they appear together more frequently than expected by chance. The HGSCore follows the Hart statistical formulation, but, for each observed putative association between proteins, it uses an integer count based on the minimum normalized spectral count of the two proteins. The HGSCore furthermore incorporates information on protein length and is easy to compute. Here we use a modified version of this scoring scheme denoted as iHGSCore, which down-weights associations that do not involve the purified bait protein (see Supplementary materials for details). This modification increases the number of baits in the final network with no adverse impact on the false discovery rate. The Dice Coefficient [13], the fourth matrix model-based scoring method, uses a completely different approach from the other 5 methods. It associates each protein (both bait and prey) with a binary vector that records its presence (1) or absence (0) in each purification. Individual interactions are

then scored by measuring the similarity between the binary vectors of the two component proteins. Dice incorporates no information on spectral counts or protein length and treats all observed associations the same way, regardless of whether they involve the purified bait protein or not. With the exception of SAINT, all analyzed scoring methods were implemented in house. The Dice Coefficient, Hart score and iHGSCore were implemented in Java, and the PE and ComPASS methods were implemented in Perl. For implementation of the ComPASS score, the randomization step was not performed to gain speed. This step is mainly used to determine the score cut-off, and does not affect the score ranking and hence has no impact on the comparative analysis performed here. SAINT (version 2.3.1) was downloaded from the URL: http://sourceforge.net/projects/saint-apms/files/. Interaction scores using the above mentioned methods were computed for all 6 datasets listed in Table 1. For the Kinome dataset, scores were computed for FLAG-, HA (Hemagglutinin)and TAP-tagged samples individually, and then combined by taking the maximum scores from all 3 samples to get the final scores for each bait–prey pair. The parameters for each method applied to each dataset are given in the supplementary Table S1. Optimal parameter values were determined by a grid search when 2 or more parameters had to be specified.

2.3. Gold standard reference sets The derived networks were benchmarked against Gold Standard (GS) reference sets. For a given organism this reference set was built from literature curated PPI data archived in the iRefWeb database (Version 4.1) [22]. The positive reference set comprised high confidence PPIs that satisfy the following criteria: are reported in at least three publications, and conserved in at least one other organism, and have an MI score > 0.43. The MI score is a measure of confidence in a molecular interaction curated from the literature [37], which integrates supporting evidence such as the number of publications reporting the interaction, how the interaction is detected (the type of experiments – direct or indirect associations, the scale of the experiments – large or small) and the extent to which the interaction is conserved (the number of organisms in which interologues exist) (for a detailed description of the MI score, see http://wodaklab.org/iRefWeb/faq#mint). Because the number of curated interactions for the fly, D. melanogaster, is small, less stringent criteria were used to define the positive reference set for the FLY dataset. These had to be reported in at least two publications, or reported in one publication and conserved in at least one other organism and have an MI score > 0.43. The positive reference sets for human, yeast and fly comprised 19,342, 16,641 and 7707 PPI, respectively. Each negative reference set encompassed the same set of proteins as in the corresponding positive reference set. It consisted of randomly selected protein pairs that are not in the iRefWeb database and its size was10 times larger than that of the positive set [29,38]. Since ComPASS and SAINT score only bait–prey interactions, the networks built using these scores were benchmarked against reference sets comprising only interactions between proteins engaged in detected bait–prey interactions (spoke GS). Given that networks built using the matrix model also encompass bait–prey

JO U RN A L OF P ROTE O M ICS 1 18 ( 20 1 5 ) 6 3 –80

interactions, spoke GS datasets were also used to benchmark networks built using respectively, spoke-based and matrix-based scoring methods. On the other hand, networks built with matrix-based scoring methods, were benchmarked using only reference sets comprising bait–prey and prey–prey interactions (matrix GS), since both types of interactions are only considered in these networks.

2.4. Scoring scheme performance measures The performance of a given scoring scheme was evaluated by benchmarking the quality of the generated network at various score thresholds against that of the corresponding reference set. To that end the following quantities were computed for a network defined using a given score threshold. True positive (TP): the number of identified interactions that are part of the positive reference set. False positive (FP): the number of identified interactions that map into the negative reference set. False Discovery Rate (FDR): FP/(TP + FP) Bait recovery rate: the fraction of all baits that is recovered in the considered network. Network size: the number of PPI in the network for a given score threshold x. Since scores are ranked in descending order (the highest score ranks 1), an often used alternative measure is the rank of the score cutoff used to threshold the network [20]. Values for the FDR and bait recovery rate range from 0 to 1. TP and FP are used to plot ROC-like curves. The figures based on these measures were generated using the R programming environment (http://www.r-project.org/). Traditionally, the Receiver Operating Characteristic (ROC) curve plots true positive rate (sensitivity) against false positive rate (1-specificity) to assess signal to noise trade-off. We prefer to use ROC-like curves that plot the number of true positives against the number of false positives at various score cut-offs. First, they allow to clearly visualize the behavior of the different curves at high score values, and hence to dispense with the use of insets that expand the scores scale of the plot in these regions that classical ROCs commonly require. Second, they enable to draw the FDR line across the different curves, illustrating the effect of applying the same FDR threshold to PPI networks generated by different methods to derive HC portions of these networks that can be compared between methods.

67

2.6. Overlap between networks derived with different scoring methods To enable a fair comparison between the networks derived by applying different scoring methods to the same raw dataset, the derived networks were thresholded at score values corresponding to the same FDR (False Discovery Rate). FDR values of 15%, 6%, and 5% were used for the CM, FLY and YMP datasets, respectively, so as to approximately match the number of high confidence interactions in the published networks [10,29,41].

2.7. Analysis of protein abundance We compare the distribution of protein abundance values in the HC PPI networks built and thresholded using the different scoring methods, to the native distribution in the cell. This comparison is then used to assess the extent to which the different scoring schemes bias the resulting network towards abundant proteins, which are more likely to be involved in non-specific interactions. Data on the cellular protein abundance levels for yeast, fly and human were downloaded from the PaxDb database [32]. This database integrates absolute protein abundance levels measured by various quantitative proteomics experiments in whole organisms and in specific tissues, for 12 model organisms. To get the best coverage of the proteomes, the PaxDb integrated datasets were used for fly and yeast. For human, the GPM dataset (http://www.thegpm.org/) was used because the integrated dataset is heavily skewed towards low abundance proteins (see supplementary Fig. S19 for the abundance profiles used). These data do not accurately reflect the protein abundance levels in the cell lines used in the proteomic studies examined in this work, and therefore represent only rough estimates of these levels. In each organism, a protein is considered abundant if its parts per million (ppm) value is above the 80th percentile in the entire proteome (14.2 ppm, 72.0 ppm, and 79.0 ppm for human, fly and yeast, respectively). We excluded from the analysis proteins with extremely low level of expression (

Extracting high confidence protein interactions from affinity purification data: at the crossroads.

Deriving protein-protein interactions from data generated by affinity-purification and mass spectrometry (AP-MS) techniques requires application of sc...
2MB Sizes 0 Downloads 8 Views