Methods 83 (2015) 51–62

Contents lists available at ScienceDirect

Methods journal homepage: www.elsevier.com/locate/ymeth

Essential protein identification based on essential protein–protein interaction prediction by Integrated Edge Weights Yuexu Jiang a,b, Yan Wang a,c,⇑, Wei Pang d, Liang Chen a, Huiyan Sun a, Yanchun Liang a,e,⇑, Enrico Blanzieri c,⇑ a

Key Laboratory of Symbolic Computation and Knowledge Engineering, College of Computer Science and Technology, Jilin University, Changchun 130012, China Department of Computer Science, University of Missouri, Columbia, MO, United States Department of Information Engineering and Computer Science, University of Trento, Povo, Italy d School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK e Department of Computer Science and Technology, Zhuhai College of Jilin University, Zhuhai 519041, China b c

a r t i c l e

i n f o

Article history: Received 28 January 2015 Received in revised form 9 April 2015 Accepted 10 April 2015 Available online 16 April 2015 Keywords: Essential protein Essential protein–protein interaction Integrated Edge Weights

a b s t r a c t Essential proteins play a crucial role in cellular survival and development process. Experimentally, essential proteins are identified by gene knockouts or RNA interference, which are expensive and often fatal to the target organisms. Regarding this, an alternative yet important approach to essential protein identification is through computational prediction. Existing computational methods predict essential proteins based on their relative densities in a protein–protein interaction (PPI) network. Degree, betweenness, and other appropriate criteria are often used to measure the relative density. However, no matter what criterion is used, a protein is actually ordered by the attributes of this protein per se. In this research, we presented a novel computational method, Integrated Edge Weights (IEW), to first rank protein–protein interactions by integrating their edge weights, and then identified sub PPI networks consisting of those highly-ranked edges, and finally regarded the nodes in these sub networks as essential proteins. We evaluated IEW on three model organisms: Saccharomyces cerevisiae (S. cerevisiae), Escherichia coli (E. coli), and Caenorhabditis elegans (C. elegans). The experimental results showed that IEW achieved better performance than the state-of-the-art methods in terms of precision–recall and Jackknife measures. We had also demonstrated that IEW is a robust and effective method, which can retrieve biologically significant modules by its highly-ranked protein–protein interactions for S. cerevisiae, E. coli, and C. elegans. We believe that, with sufficient data provided, IEW can be used to any other organisms’ essential protein identification. A website about IEW can be accessed from http://digbio.missouri.edu/IEW/index.html. Ó 2015 Elsevier Inc. All rights reserved.

1. Introduction Essential proteins are indispensable for the survival of an organism under certain conditions [1]. Reliable identification of essential proteins is of great significance since it can contribute to a better understanding of the key biological processes of an organism at molecular level, which is useful for guiding drug design, disease diagnosis, and medical treatments. Experimentally, many researchers identify proteins’ essentiality by knocking out some particular proteins and checking the viability of the affected organisms [1,2]. However, the cost of such biological wet-lab ⇑ Corresponding authors at: Key Laboratory of Symbolic Computation and Knowledge Engineering, College of Computer Science and Technology, Jilin University, Changchun 130012, China. E-mail addresses: [email protected] (Y. Wang), [email protected] (Y. Liang), [email protected] (E. Blanzieri). http://dx.doi.org/10.1016/j.ymeth.2015.04.013 1046-2023/Ó 2015 Elsevier Inc. All rights reserved.

experiments is normally very high, and more importantly, they are ethically impossible on humans. This makes in silico analysis a necessary method of choice to carry out the research. Currently, there is still much work to be done by computational biologists for the effective identification of essential proteins. Nowadays, due to high-throughput techniques, large-scale protein–protein interaction (PPI) data are available for many organisms, especially for some model organisms such as Saccharomyces cerevisiae and Escherichia coli. Based on these data, several studies have been conducted, and these studies aim to investigate the relationships between experimentally identified essential proteins and PPI networks. Jeong et al. [3] noted that the essentiality of a protein had high correlation with its centrality in a PPI network, and this observation is formulated as the centrality-lethality rule [3,4]. Guided by this rule, many measures have been proposed for essential protein detection, such as degree centrality [3], betweenness

52

Y. Jiang et al. / Methods 83 (2015) 51–62

centrality [4], closeness centrality [5], subgraph centrality [6], eigenvector centrality [7], and network bottleneck [8]. Basically these methods rank proteins based on their centrality measures in a PPI network to identify their essentiality. In addition to these purely node-centrality based algorithms, a few edge-aided methods have also been developed. For instance, Wang et al. [9] employed the concept of edge clustering coefficient (the NC method) to identify essential proteins in a PPI network. Further improvements of the NC method were achieved by taking gene expression information (PeC) [10] into consideration. Although edge information plays an important role in the prediction processes of these edge-aided methods, the fundamental idea behind these methods is still ranking proteins according to their centrality measure in the PPI network. In 2005, Pereira-Leal et al. [11] pointed out that essential proteins tended to be more frequently connected to other essential proteins rather than to non-essential proteins in S. cerevisiae PPI networks. They found that after removing all the non-essential proteins from a PPI network, approximately 97% of the essential proteins were still connected, and this suggested a close interaction relationship among essential proteins. He et al. [12] tried to explain the reason why highly connected nodes tend to be essential and proposed the concept of essential protein–protein interactions. They argued that the essentiality of proteins came from the essentiality of protein–protein interactions rather than the proteins per se, changing substantially the perspective of the problem. Regarding this, some researchers have taken this direction by scoring the relatedness of proteins connected by edges in a PPI network [9,10,13]. Some of these measures are based on the topology of a PPI network, such as the number of triangles an edge belongs to [9], while other measures are obtained by integrating other biological information, such as Gene Ontology similarity and gene co-expression degree (the EW method) [10,13]. In this paper, we presented a novel essential-protein prediction strategy. Unlike other state-of-the-art methods which directly rank proteins, our method (IEW) predicted essential proteins based on Integrated Edge Weights. By integrating several widely used PPI topological information and biological data, IEW can overcome the possible failure of one or more attributes. We took a comprehensive evaluation on the S. cerevisiae, E. coli, and Caenorhabditis elegans datasets, and proved that IEW was a more accurate and robust method than its competitors. Furthermore, the predicted high-ranked edges tend to be highly biologically significant in S. cerevisiae, E. coli, and C. elegans PPI networks. 2. Methods After collecting data from various sources, the whole workflow of our method is divided into four parts, as shown in Fig. 1. First, we assess a particular interaction by using five different measurements. Second, we integrate these five measurements into a final weight so that we can rank all the links of a PPI network and obtain a list of the essential interactions. Third, we predict essential proteins based on the obtained essential interaction list. Finally, we use three evaluation methods to test our prediction strategy. 2.1. Protein–protein relationship evaluation The IEW model aims to evaluate the relationship between two proteins from various perspectives. To achieve this purpose, we integrated into our final model five measures ranging on topology information, gene expression information, physical interaction, gene annotation, and degree of conservation. Among them, topology information, gene expression information, and gene annotation information have been widely used, while to the best of our

Fig. 1. The overall workflow of the proposed method. After data collection, our method can be divided into four steps: (A) protein–protein relationship evaluation; (B) essential PPI prediction by the Integrated Edge Weight; (C) essential protein prediction; (D) prediction result evaluation.

knowledge physical interaction information and degree of conservation are used for the specific purpose of this research for the first time. 2.1.1. Measure 1: number of triangles Topological characteristics of PPI networks encode important information related to the lethality of the absence of a protein. According to the centrality-lethality rule, we considered that essential protein–protein links should tend to be more cliquish. Estrada [14] reported that the proteins selected by any of the spectral measures of centrality tended to form clusters of highly interconnected nodes, and these clusters contained a large number of triangles as measured by the clustering coefficient. Therefore, in this research we used the number of triangles (NTE) as one of the measures to determine the significance and centrality of an edge. In an undirected graph G = (N, E), where N is the set of the proteins (nodes) in the network, and E is the set of the interactions (edges), the NTE of an edge (u, v) is defined as:

NTEðu;

v Þ ¼ jC u \ C v j þ 1;

ð1Þ

where C u (or C v ) denotes the set of neighbours of node u (or v) in a PPI network; jC u \ C v j is the number of neighbours shared by nodes u and v, which coincides with the number of triangles that the edge (u, v) belongs to. We add the value ‘‘1’’ at the end of the equation to make the result always bigger than zero. This is to prevent that NTEs between every two proteins are equal to zero, which will cause problems in the normalisation process. 2.1.2. Measure 2: gene expression similarity Gene expression data are perhaps the most easily obtained and widely used biological data. Studying co-expression patterns [15] can provide useful insights to analysing the underlying cellular processes. Because the co-expressed genes have a high probability to encode interacting proteins [16], we chose gene expression similarity as one of the five measurements. In our method, we used Pearson Correlation Coefficient as the gene expression similarity testing method. The gene expression similarity (GES) of proteins u and v are calculated as follows:

53

Y. Jiang et al. / Methods 83 (2015) 51–62

" #" # 1 Xs U i  U V i  V ; GESðu; v Þ ¼ i¼1 s1 rðUÞ rðVÞ

ð2Þ

where genes U and V encode the corresponding pair of proteins u and v, respectively; s is the number of samples in the gene expression data; U i and V i are the expression levels of genes U and V in the corresponding sample i, respectively; U and V represent the means of the expression levels of genes U and V, and rðUÞ and rðVÞ represent the standard deviation of expression levels of genes U and V. 2.1.3. Measure 3: GO semantic similarity GO (Gene Ontology) [17] is designed to represent the known relationships between biological terms and the genes that are instances of those terms. GO semantic similarity is based on the biological characteristics of genes to reveal genes functionality similarity. There exists a common assumption that two essential proteins linked by an edge are more likely to participate in the same biological processes [18]. Therefore we only considered the biological process category of GO. In this research we used Lin algorithm [19], which is widely used and integrated into the GO tool FastSemSim [20]. The GO similarity (GOS) between proteins u and v is defined as:

GOSðu;

v Þ ¼ LINðU;

VÞ ¼ max

Pms ðc1 ; c2 Þ ¼ min fPðcÞg; c2Sðc1 ; c2 Þ

2  ln Pms ðc1 ; c2 Þ ; ln Pðc2 Þ þ ln Pðc2 Þ

ð3Þ ð4Þ

where genes U and V encode the pair of proteins u and v; c1 and c2 are the terms of genes U and V, respectively. These terms are concepts in the GO structure. P(c) is the probability of encountering an instance of term c; Sðc1 ; c2 Þ is the set of terms that subsume both c1 and c2 . We choose the maximum GO similarity of a term pair as the GO similarity of u and v. 2.1.4. Measure 4: domain interaction strength Within a protein, a structural domain (simply called ‘‘domain’’) is an element of the overall structure that is self-stabilising and folds independently of the rest of the protein chain [21]. Domain information is sometimes used to predict the existence of a protein–protein interaction [22]. However, to the best of our knowledge, we are the first to use domain information for measuring the essentiality of a protein–protein interaction. Intuitively, an essential interaction should have a stronger connection than the non-essential ones. Only in this way can an organism live well, since it is harder to destroy an essential interaction. We simply calculated a domain score (DS) for every edge in the PPI as their connection strength. We used InterProScan5 [23] to find the domains that each protein possesses. Then we allocate a weight for a domain-domain interaction. The domain score of proteins u and v is defined as:

DSðu;

vÞ ¼

X W ij  Di  Dj ;

if two genes are present or absent together in many genomes, the proteins they encode are more likely to have an essential interaction, since no functions can be performed by each of them alone. In this research, we calculated the phylogenetic profile similarity (PPS) between proteins u and v by using an improved Pearson Correlation method [25] as shown below:

PPSðu;

Nzn n ðN  nu  nu ÞðN  nv  nv Þ

u v ; v Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2

ð6Þ

where N is the total number of genomes used. For proteins u and v, nu and nv are the number of times that proteins u and v having orthologous genes in N genomes, respectively. z is the number of times that proteins u and v occur together in the N genomes. 2.2. Essential PPI Prediction by the Integrated Edge Weight We used the five measurements described above to evaluate the relationship between two proteins in a PPI network. Gene expression similarity (GES) and GO semantic similarity (GOS) ranges from 0 to 1, and we normalised the ranges of number of triangles (NTE), domain interaction strength (DS) and phylogenetic profile similarity (PPS) to the same range [0, 1]. We can treat each value as strength of the evidence given by the corresponding measurement in support of the essentiality of the interaction. Thus, a 5dimensional vector of which all elements have value 1 means that all the five variables maximally suggest the edge to be an essential PPI. Then we used a straightforward geometric model to predict the PPI essentiality. The details of the model are illustrated in Fig. 2. Formally, the essentiality of a PPI measured by integrating edge weights is defined as follows: ! IEWðu; v Þ ¼ cos h  j DATA j ! ! ! DATA  ONE ¼ ! !  j DATA j j DATA j  j ONE j NTEðu; v Þ þ GESðu; v Þ þ GOSðu; v Þ þ DSðu; v Þ þ PPSðu; v Þ pffiffiffi ; ¼ 5 ð7Þ

ð5Þ

i; j

where i and j are the domains of u and v, respectively; Di is the number of domain i in protein u; Dj is the number of domain j in protein v, and W ij is the interaction weight of domain i and domain j. A detailed definition of W ij is given in the material section (Section 3.2.4). 2.1.5. Measure 5: phylogenetic profile similarity Phylogenetic profile similarity [24] is used to measure the conservation degree of genes in a set of organisms. Given an organism list, if there exists a gene pair such that their orthologous genes cooccur across several genomes, we can say that these two genes have a high phylogenetic profile similarity [24]. We believe that

Fig. 2. The geometric model. We illustrate our geometric model using a 2dimensional graph, although our model operates in a 5-dimensional space. As ! ! shown in the graph, OA is the all-ones vector (1, 1), OB is a specific data vector, and ! its essentiality is measured by the module of the projection jOCj on the!vector OA. ! We can see that jOCj not! only depends on the angle h between OA and OB, but also ! depends on the normof OB. In other words, the bigger the values of OB, or the closer ! ! the vectors OA and OB will be, the more evidence of the essentiality of the PPI ! represented by OB may get.

54

Y. Jiang et al. / Methods 83 (2015) 51–62

Fig. 3. Boxplots showing different attributes’ ability to distinguish the essential-essential protein interactions from the other two interaction types (essential-nonessential and nonessential-nonessential interactions). (A) is for gene expression similarity, and the mean values for interaction types from left to right are 0.4161, 0.3304, and 0.3614, respectively. (B) is for GO semantic similarity in the biological process category, and the mean values are 0.5936, 0.4001, and 0.3799, respectively. (C) is for number of triangles, and the mean values are 0.0692, 0.0458, and 0.0392, respectively. (D) is for phylogenetic profile similarity, and the mean values are 0.8701, 0.8243, and 0.8244, respectively. (E) is for domain interaction strength, and the mean values are 0.0141, 0.0075, and 0.0060, respectively. In each of the five graphs, the X-axis represents the interaction types, and the Y-axis represents the normalised values of corresponding attributes. All measurements are scaled to [0, 1]. The boxes in (C), (D) and (E) seem compressed, because there exist some outliers in the normalisation procedure.

 ! where DATA is the 5-dimensional data vector composed of NTE, ! GES, GOS, DS and PPS; ONE is the 5-dimensional vector whose com! ponents are all identically equal to 1; h is the angle between ONE ! and DATA. 2.3. Essential protein prediction by essential PPI After calculating the Integrated Edge Weight for each edge in a PPI network, we sorted the edges in a descending order according to their weights. The edge list can be presented as [e1, e2, e3. . .em], and we have IEWðe1 Þ P IEWðe2 Þ P IEWðe3 Þ P ::: P IEWðem Þ, where IEWðei Þ is the weight of edge i and m is the number of edges. From the edge list, we can determine the order of any two proteins. Take proteins u and v as an example: protein u proceeds protein v only if protein u firstly appears in an edge whose weight is greater than or equal to the weight of the edge in which protein v firstly appears. In this way, IEW gets its essential protein list. Instead of quantifying each protein, we used this way to predict essential protein because the main idea is to rank the edges rather than the nodes. Consequently, there is no need to quantify the essentiality of proteins. 2.4. Prediction results evaluation We compared the performance of IEW against NC, PeC, EW, degree method [3], and pure random method on S. cerevisiae, E. coli, and C. elegans datasets. We chose NC and PeC because it has been acknowledged that NC and PeC are the state-of-the-art methods that perform better than the previously published centrality-based measures [9,10] and EW because it is also a recently

published state-of-the-art integrated method [13]. We chose the degree method as a representative of the centrality-based methods, and we chose the pure random method as a baseline control. To evaluate each method’s overall performance, we used the precision–recall curve and the Jackknife curve as presented in [26], which measures the number of true positives among the top ranked list. Finally, we analysed the pathway enrichment by using DAVID [27] to examine the biological significance of the modules composing the top-ranked interactions. 2.4.1. Precision–recall curve A precision–recall (PR) curve for a method to be compared is defined as follows:

PrecisionðnÞ ¼

RecallðnÞ ¼

TPðnÞ ; TPðnÞ þ FPðnÞ

TPðnÞ ; P

ð8Þ

ð9Þ

where TP(n) is the number of true positives among the top n ranked proteins, FP(n) is the number of non-essential proteins among the top n ranked proteins, and P is the total number of essential proteins under consideration. 2.4.2. Jackknife curve A Jackknife curve represents the number of samples that are correctly predicted among a top ranked prediction list, denoted as Jackknife(n) for the number of true positives among the top n predictions. In a 2D representation, the x-axis denotes the top x proteins defined by IEW, NC, and PeC respectively and the y-axis is the number of essential proteins correctly predicted among the top x proteins.

Y. Jiang et al. / Methods 83 (2015) 51–62

55

Fig. 4. The comparison results on the S. cerevisiae data. (A) is the comparison of IEW, NC, PeC, EW, pure degree, and random using the PR curves. IEW shows the best overall performance. (B1) is the ability of the three methods to identify the essential proteins among their top 120 (10% of 1111 essential proteins) ranked predictions. The top prediction results have more significance than the overall performance, because biologists can only verify the essentiality of a small amount of proteins by wet-lab experiments. (B2) shows the overall performance of all methods represented by the Jackknife curves. The performance of NC and PeC is very similar, so the Jackknife curves of these two approaches largely overlap.

2.4.3. Pathway enrichment analysis We had carried out pathway enrichment analysis among several sub-networks by using DAVID with statistical significance pvalues calculated by the modified Fisher’s exact test [27]. These sub-networks are constructed by different amounts of the top ranked IEW edges. To correct the enrichment p-values and control the family-wide false discovery rate (FDR), a Benjamini Hochberg (BH) [28] testing correction was used by DAVID. 3. Material 3.1. Essential protein list The lists of essential genes of all three model organisms (S. cerevisiae, E. coli, and C. elegans) were collected from the OGEE [29] database, in which all genes are grouped into three categories: essential, non-essential, and conditional. In this research we considered conditional genes as essential genes, since our method is trying to identify essential proteins under any condition. To maintain the consistency of different data sources, we cannot use the entire essential list. Finally, for S. cerevisiae, we selected 1111 essential proteins, 2673 nonessential proteins, and 64 unknown proteins that are in DIP database [30] but not in OGEE. For E. coli, we obtained 458 essential proteins and 1371 nonessential proteins as our E. coli dataset. For C. elegans, we obtained 150 essential

proteins, 422 nonessential proteins, and 270 unknown proteins from the OGEE database. 3.2. Source of measures 3.2.1. Number of triangles data The number of triangles is basically obtained from PPI information. The PPI data of S. cerevisiae and E. coli were collected from the DIP [30] database released on Jan. 17th, 2014. This database consists of 22,637 distinct interactions among 5068 proteins for S. cerevisiae, 12,198 distinct interactions among 2904 proteins for E. coli and 4118 distinct interactions among 2708 proteins for C. elegans. 3.2.2. Gene expression data The gene expression data of S. cerevisiae were collected from the GEO [31] database. We chose the dataset GSE3431 [32], which was collected at 12 time points during three successive metabolic cycles. The dataset contains 36 samples with 6777 genes. The E. coli gene expression data GSE6425 [33] were also downloaded from GEO. These transcriptional expression data were measured at various time points during aerobic or anaerobic growth in the Luria–Bertani medium. There are 44 samples with 4,345 genes in GSE6425. We used GSE6547 [34] as the gene expression data for C. elegans. The transcriptom profiles of lin-35 mutants and the wild

56

Y. Jiang et al. / Methods 83 (2015) 51–62

Fig. 5. The comparison results on the E. coli data. (A) is the comparison of IEW, NC, PeC, EW, pure degree, and random using the PR curve. IEW shows a significant performance improvement when the value of recall is less than 0.1. (B1) is the ability of the three methods to identify the essential proteins among their top 50 (10% of 458 essential proteins) ranked predictions. (B2) shows the overall performance of all methods represented by the Jackknife curves. It is noted that the curves for NC and PeC largely overlap.

type worms were collected at 3 developmental stages (embryonic, L1, and L4). There are 18 samples and we used 9 of them, which are of the wild type worms. 3.2.3. GO data The Gene Ontology and Annotation for S. cerevisiae, E. coli and C. elegans were all obtained from the Gene Ontology Consortium website [17]. The data on the website are nearly daily updated. The annotation data for S. cerevisiae (released on July. 13th 2014) contains 94,181 annotations; the annotation data for E. coli (released on June. 26th 2014) contains 45,978 annotations and the annotation data for C. elegans (released on June. 4th 2014) contains 20,054 annotations. As to the Ontology data, the relationships of ‘‘part of’’, ‘‘has part’’, ‘‘regulates’’ were ignored for simplicity. We only considered the ‘‘is a’’ relationship and there are 39,446 terms and 66,193 edges involved in the end. 3.2.4. Domain data We obtained the domain information of a particular protein by using the InterProScan5 [23] program. The domain-domain interaction weights were obtained from the DOMINE database (version 2.0) [35]. DOMINE is a database of known and predicted protein domain (domain-domain) interactions. It contains interactions inferred from experiments, and it also

contains the interactions that are predicted by 13 different computational approaches. DOMINE contains a total of 26,219 domain-domain interactions, out of which 6634 are inferred from experiments, 2989 are high-confidence predictions, 2537 are medium-confidence predictions, and the remaining 16,094 are low-confidence predictions. We used these four data confidence levels to weight the domain-domain interactions as 4, 3, 2 and 1, respectively.

3.2.5. Phylogenetic profile data We collected the phylogenetic profile data from the InParanoid database (Version 8.0) [36], which contains ortholog information for 273 organisms. The InParanoid program uses the pairwise similarity scores (calculated using NCBI-Blast [37]) between two complete proteomes to construct orthology groups. An orthology group is initially composed of two orthologs that are called seed orthologs and found by the two-way best hits between two proteomes. More sequences are added into the group if there are sequences in the two proteomes that are closer to the corresponding seed ortholog than to any sequence in the other proteome. These members of an orthology group are called inparalogs. A confidence value is provided for each inparalog that shows how closely related it is to its seed ortholog. In this research, we only used the seed orthologs pair whose Bootstrap values is 100% for the best accuracy.

Y. Jiang et al. / Methods 83 (2015) 51–62

57

Fig. 6. The comparison results on the C. elegans data. (A) is the comparison of IEW, NC, PeC, EW, pure degree, and random using the PR curve. (B1) is the ability of the three methods to identify the essential proteins among their top 150 (this number equals to 100% essential proteins of C. elegans) ranked predictions. (B2) shows the overall performance of all methods represented by the Jackknife curves. Again the curves for NC and PeC largely overlap.

3.3. A summary of IEW 3.3.1. Input of IEW (1) (2) (3) (4) (5)

Protein–protein interaction data. Expression data for each protein among several samples. GO terms for each protein. Domain information for each protein. Phylogenetic profile for each protein among several organisms.

3.3.2. Output of IEW Essential proteins predicted by the sorted essential protein– protein interactions. 3.3.3. The workflow of IEW See Fig. 1. 3.3.4. Tools and databases used in IEW (1) (2) (3) (4)

GO database [17]. FastSemSim [20]. InterProScan5 [23]. The Database for Annotation, Visualisation and Integrated Discovery (DAVID) [27].

(5) (6) (7) (8) (9) (10)

OGEE database [29]. DIP database [30]. GEO database [31]. DOMINE database [35]. Inparanoid database [36]. KEGG database [38].

4. Results and discussion 4.1. Performance on the three datasets Before comparing performance with the other methods, we first performed a preliminary experiment on S. cerevisiae with an aim to analyse the effectiveness of each of the five measurements that IEW uses. The experimental results are shown in Fig. 3. From Fig. 3 and the corresponding values of each attribute, we can see that all five selected measurements contribute to the prediction ability of IEW to different extents. It seems natural to assign different weight to attributes according to their contributions. However, as the geometric model illustrated above, to avoid some of the attributes bias the result, we assigned equal weight to each measurement. In this way, a particular attribute’s favour or failure will be less likely to determine the final integrated weight. This also illustrates the rationale of the attribute selection strategy.

58

Y. Jiang et al. / Methods 83 (2015) 51–62

Fig. 7. The robustness analysis by comparing the effect of PPI network damaged by deleting edges randomly. For S. cerevisiae and E. coli datasets, IEW has the minimum fluctuation compared with the other two, which means that deleting edges randomly has less influence on the performance of IEW than that of the other two.

Fig. 8. The robustness analysis by comparing the effect of PPI network damaged by deleting top-ranked IEW edges. The performance of all three methods drops. This indicates that top-ranked IEW edges are also important for NC and PeC.

Y. Jiang et al. / Methods 83 (2015) 51–62

For each attribute in Fig. 3, we sampled randomly for 10,000 times to calculated the statistical significance of the values’ difference between different interaction types. All the calculated p-values are all below 104. 4.1.1. Evaluation results on S. cerevisiae datasets In all the graphs showing comparison of results (Figs. 4–6), we presented all the six methods (IEW, NC, PeC, EW, Degree, and Random) in the PR curve, but only presented three (IEW, NC, and PeC) of them in the Jack-knife graphs. This is because EW is also a link-weight based integrated method, and we just took IEW, whose performance is better, as a representative of this kind. As to the Degree and Random methods, their poor performance makes them far from comparable. We first compared these methods using the PR curve on the S. cerevisiae datasets. The IEW shows a significant improvement, except that there is a small performance drop when the value of recall is around 0.3 (Fig. 4A). Then we evaluated the three methods in terms of the numbers of essential proteins among their individual top-ranked proteins (Fig. 4B2), and found that IEW performs substantially better than the other two methods (the curves for NC and PeC are almost identical). Specifically, compared to the overall performance, people are more interested in the top ranked predictions, so we checked their ability to successfully identify essential proteins among their top 120 ranked predictions (Fig. 4B1). 4.1.2. Evaluation results on E. coli datasets Following the same procedure, the comparison results of the six methods on E. coli datasets are shown in Fig. 5A. From this figure we see that when the value of recall is small, IEW achieves the best

59

performance. However, we also notice that when the recall value is between 0.1 and 0.2, IEW does not perform as well as others. Once again, we tested the three methods’ prediction ability in terms of the numbers of essential proteins among their top-ranked proteins (Fig. 5B1). IEW is still superior to the other two, both on the top 50 ranked proteins and the whole dataset (Fig. 5B2). 4.1.3. Evaluation results on C. elegans datasets Finally, we performed the three methods on C. elegans datasets. Because the PPI data of C. elegans are sparse, all six methods did not perform as well as they did in the other two datasets. Still, IEW shows a better overall result than the others, as shown in Fig. 6A. We can see that the performance of NC and PeC is almost the same as random prediction. We guess that this is because these two methods mainly depend on the ECC (derived from PPI data). This also indicates that IEW can overcome the single-attribute-dependence problem, because other attributes that IEW used can make up the poor prediction power of one single attribute. This also motives us to test the robustness of IEW, which will be presented in the Section 4.2. At last, we performed the Jackknife test for the three methods (Fig. 6B). IEW again shows a significant performance over the other two. 4.2. Robustness and generalisation analysis Robustness is an important feature of a method as it reflects the method’s practical applicability. Moreover, some prediction methods work well on simulated dataset, but may not work well on real datasets. This is because researchers tend to use well-formed data when they design an algorithm, but in practice data are always affected by defects that decrease their quality. In the case of NC

Fig. 9. The robustness analysis by comparing the effect of PPI network damaged by deleting bottom-ranked IEW edges. For S. cerevisiae and E. coli datasets, IEW is almost unaffected by deleting from bottom-ranked IEW edges (almost no fluctuation), while NC and PeC achieve a better result.

60

Y. Jiang et al. / Methods 83 (2015) 51–62

and PeC, the quality of PPI data in DIP database marred the performance of the two methods. However, integrated methods like IEW have the chance to be more robust, since the use of various data may compensate for each other’s deficiency. In other words, a specific attribute’s poor prediction performance may not result in the poor performance of the final prediction. In order to test the robustness of IEW, we had carried out a series of experiments. We first damaged the PPI network in DIP database purposely and then compared its influence on the performance of IEW, NC, and PeC. The results are shown in Figs. 7–9. We firstly damaged the PPI network by deleting edges randomly. The results are shown in Fig. 7, and we can see that IEW is much steadier than the other two methods in S. cerevisiae and E. coli datasets. We hypothesise that this is because only a small portion of PPIs are truly essential, and IEW can identify them correctly, but NC and PeC consider a bigger portion of PPIs as important ones. In other words, NC and PeC have a high false positive rate, and that is why their performance dropped dramatically when we deleted edges randomly. In order to prove our hypothesis, we deleted the top-ranked IEW edges and compared the performance of the three algorithms on the regenerated network in Fig. 8. The performance of all three methods on S. cerevisiae and E. coli dropped (red line higher than others), which means that the top-ranked IEW edges are important for each method, and they are very likely to be the true essential PPIs. Then, we deleted the bottom-ranked IEW edges, and the results are shown in Fig. 9. In this figure we can see that IEW results do not change significantly, while the performance of NC and PeC is getting better and better (red line is lower than others) with the removal of more and more bottom-ranked IEW edges. This indicates that many unessential PPIs were considered as important ones in NC and PeC, so these two algorithms cannot achieve high prediction accuracy. When we delete these interactions, NC and PeC can obtain better results. As to the results on C. elegans datasets in Figs. 7 and 9, since its PPI data are in low quality, it seems that deleting or not does not make much difference. Results presented in Figs. 7–9 indicate that IEW has a better robustness than NC and PeC, which demonstrates the advantage of integrated methods in overcoming the single-attribute-dependence problem. Since we deleted edges following the same rule

Table 1 The DAVID enrichment analysis result on the specific module example for each organism. Organism

S. cerevisiae

E. coli

C. elegans

GO term

Nucleolus

Ribosome

Relevant/Total (%) p-value Benjamini

19/20 (95%) 6.8E22 2.1E20

26/38 (68%) 1.0E50 1.3E48

Intracellular non-membranebounded organelle 6/10 (60%) 7.9E6 2.4E4

for the three methods, which means all the three methods were performed on the same network, it should be the robustness of ‘‘methods’’ that made the results different instead of the robustness of the robustness of ‘‘network’’. Generalisation ability is another important criteria to evaluate an algorithm. Although all the three methods (IEW, NC and PeC) generalise well on the three model organisms’ datasets, IEW achieves the best performance according to the results in Figs. 4–6. 4.3. Biological significance examination In order to examine the biological significance of the highranked interactions obtained from IEW, we selected the top 100 ranked interactions from every organism to perform the DAVID enrichment analysis [27]. Furthermore, with the guidance of researchers with biological background, we also mapped those proteins linked with high-ranked interactions to organisms’ pathways in KEGG [38] and evaluate their essentiality in metabolism. We observed that the top 100 ranked interactions tended to form many closely connected modules and we only selected the biggest one as a case study for each organism. The top 100 ranked interactions and the modules they form are shown in Fig. 10. 4.3.1. Case study on S. cerevisiae As we can see from Fig. 10, the specific module in S. cerevisiae contains 20 proteins. 18 of them are recognised as essential in OGEE database. We conducted a literature search using these known essential proteins and found that most of them (12 out of 18) have associations with nascent 60S ribosomal particles [39].

Fig. 10. The modules formed by the top 100 ranked interactions by IEW in the three datasets. The specific modules for case study are zoomed in for a closer examination. Red circles are the essential proteins that we predicted as essential. Green circles are the non-essential proteins that we predicted as essential. (A) S. cerevisiae network, and the module we study consists of 20 proteins. (B) E. coli network, and the biggest module consists of 38 proteins. (C) C. elegans network. Although its modularization is not obvious, we select the biggest module as a case study to perform analysis, and this module contains only 10 proteins.

Y. Jiang et al. / Methods 83 (2015) 51–62

61

Fig. 11. The connecting tendency in IEW method. To check if connecting tendency was reflected in the IEW method, we used essentiality-known proteins to calculate the IEW values between different interaction styles (essential–essential, essential–nonessential, and nonessential–nonessential interactions). (A), (B), (C) are the results in S. cerevisiae, E. coli and C. elegans, respectively.

In that article, they showed that the association of preribosomal factors with pre-60S complexes depends on the presence of earlier factors, a phenomenon essential for ribosome biogenesis [39]. After mapping these proteins to S. cerevisiae pathway network, we found that they tended to enrich at three pathways: RNA polymerase, Pyrimidine metabolism, and Purine metabolism. This means that they are crucial to the downstream RNA synthesis process. 4.3.2. Case study on E. coli Among the 29 essential proteins in this specific module of E. coli, 23 proteins were observed in ribosomes with good sensitivity and high accuracy using mass spectrometry [40]. Thus, these proteins truly contribute to the ribosome synthesis. In addition, the pathway analysis revealed that all these 23 proteins can be mapped to the ‘‘Ribosome’’ pathway. 4.3.3. Case study on C. elegans In the selected module of C. elegans, essential proteins are not as dense as those in the other two organisms. However, in [41] it was mentioned that 3 out of 5 essential proteins in this module are required for the first two rounds of cell division in the C. elegans embryo. We did not enrich anything on the pathway, but according to the GO enrichment results, these proteins are related with nonmembrane-bounded organelle. It makes sense to some extent that non-membrane-bounded organelle is essential during cell division. All the DAVID reports are summarised in Table 1. More detailed data about the top 100 ranked PPI are provided in the Supplementary materials. 4.4. Rich-club phenomena We have also paid attention to the connection tendency, which is one of the rich-club phenomena [42,43] of essential proteins in PPI network. IEW has been proposed based on the above idea, and by using IEW essential proteins are found effectively. Thus, it is natural to verify that whether the tendency is truly reflected in IEW by using already known essential proteins (Fig. 11). From Fig. 11, we can see that in each of the three datasets, essentialessential interactions have a higher IEW value than the other two interaction styles. Also, there are some other interesting properties of essential and non-essential proteins. Overall, essential proteins tend to be distributed in large clusters of sub-graphs (which are the simplified networks by partially ranked edges based on IEW value, such

as Fig. 10A and B), whereas non-essential proteins tend to be distributed in small clusters of a couple of nodes (Fig. 10C). This can be explained by the rich-club phenomena in PPI network. Although essential proteins seem to be scattered in C. elegans, the tendency still holds: half of the nodes in the giant cluster are essential, whereas the rest clusters mostly consist of non-essential ones. Overall, these observations are consistent with the conclusion in [42,43]. 5. Conclusions In this research, through the integration of five measurements from different perspectives, we proposed a novel method called IEW to predict essential proteins. IEW is fundamentally different from other methods because it ranks protein–protein interactions according to their weights calculated from the five measurements and then predicts essential proteins connected by the top ranked edges, rather than identifying essential proteins directly based on centrality-lethality rules as in other approaches. We first performed a preliminary experiment showing that each of the five measurements has different ability to distinguish essential links from other kinds of interactions. Then we demonstrated that IEW performed better than other state-of-the-art methods through the results presented by the PR and Jackknife curves on all three datasets. Especially when the data available cannot provide enough information for one specific attribute to perform the effective identification, IEW can still produce relatively high prediction accuracy, which demonstrates the robustness of the proposed integrated strategy. Finally, it is noted that during the calculation of IEW we can also obtain the sub-networks of a PPI network, and such sub-networks may imply important biological functions, which are worthy of being further investigated. Furthermore, we believe that in the future IEW can be used as a powerful tool for essential protein identification in a wide range of organisms, provided that enough information for these organisms is available. Acknowledgments The first author would like to thank Duolin Wang at Jilin University and Dr. Trupti at the University of Missouri for their inspiring discussion during the development of the model and the biological analysis. This work was supported by the Natural

62

Y. Jiang et al. / Methods 83 (2015) 51–62

Science Foundation of China (Grant Nos. 61272207, 61402194, 61472159) and Development Project of Jilin Province of China (Grant No. 20140101180JC). Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ymeth.2015.04. 013. References [1] G. Giaever, A.M. Chu, L. Ni, C. Connelly, L. Riles, S. Véronneau, S. Dow, A. LucauDanila, K. Anderson, B. André, Nature 418 (2002) 387–391. [2] T. Roemer, B. Jiang, J. Davison, T. Ketela, K. Veillette, A. Breton, F. Tandia, A. Linteau, S. Sillaots, C. Marta, N. Martel, S. Veronneau, S. Lemieux, S. Kauffman, J. Becker, R. Storms, C. Boone, H. Bussey, Mol. Microbiol. 50 (2003) 167–181. [3] H. Jeong, S.P. Mason, A.L. Barabasi, Z.N. Oltvai, Nature 411 (2001) 41–42. [4] M.P. Joy, A. Brock, D.E. Ingber, S. Huang, J. Biomed. Biotechnol. 2005 (2005) 96– 103. [5] S. Wuchty, P.F. Stadler, J. Theor. Biol. 223 (2003) 45–53. [6] E. Estrada, J.A. Rodriguez-Velazquez, Phys. Rev. E: Stat. Nonlinear Soft Matter Phys. 71 (2005) 056103. [7] P. Bonacich, Am. J. Sociol. (1987) 1170–1182. [8] N. Przulj, D.A. Wigle, I. Jurisica, Bioinformatics 20 (2004) 340–348. [9] J.X. Wang, M. Li, H. Wang, Y. Pan, IEEE ACM T Comput. Biol. 9 (2012) 1070– 1080. [10] M. Li, H.H. Zhang, J.X. Wang, Y. Pan, BMC Syst. Biol. 6 (2012). [11] J.B. Pereira-Leal, B. Audit, J.M. Peregrin-Alvarez, C.A. Ouzounis, Mol. Biol. Evol. 22 (2005). 1157-1157. [12] X.L. He, J.Z. Zhang, PLoS Genet. 2 (2006) 826–834. [13] Y. Wang, H. Sun, W. Du, E. Blanzieri, G. Viero, Y. Xu, Y. Liang, PLoS ONE 9 (2014) e108716. [14] E. Estrada, Proteomics 6 (2006) 35–40. [15] S.L. Carter, C.M. Brechbuhler, M. Griffin, A.T. Bond, Bioinformatics 20 (2004) 2242–2250. [16] C. von Mering, R. Krause, B. Snel, M. Cornell, S.G. Oliver, S. Fields, P. Bork, Nature 417 (2002) 399–403. [17] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G. Sherlock, G.O. Consortium, Nat. Genet. 25 (2000) 25–29.

[18] N.J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A.P. Tikuisis, Nature 440 (2006) 637–643. [19] D. Lin, ICML 1998, pp. 296–304. [20] M.M. FastSemSim. unpublished. Last accessed on Aug 1st, 2014. [21] T. Pawson, P. Nash, Science 300 (2003) 445–452. [22] J.R. Bock, D.A. Gough, Bioinformatics 17 (2001) 455–460. [23] P. Jones, D. Binns, H.Y. Chang, M. Fraser, W. Li, C. McAnulla, H. McWilliam, J. Maslen, A. Mitchell, G. Nuka, S. Pesseat, A.F. Quinn, A. Sangrador-Vegas, M. Scheremetjew, S.Y. Yong, R. Lopez, S. Hunter, Bioinformatics 30 (2014) 1236– 1240. [24] M. Pellegrini, E.M. Marcotte, M.J. Thompson, D. Eisenberg, T.O. Yeates, Proc. Natl. Acad. Sci. U.S.A. 96 (1999) 4285–4288. [25] L.F. Chen, D. Vitkup, Genome Biol. 7 (2006). [26] A.G. Holman, P.J. Davis, J.M. Foster, C.K. Carlow, S. Kumar, BMC Microbiol. 9 (2009) 243. [27] D.W. Huang, B.T. Sherman, R.A. Lempicki, Nat. Protoc. 4 (2009) 44–57. [28] Y. Benjamini, Y. Hochberg, J. Roy. Stat. Soc.: Ser. B (Methodol.) (1995) 289–300. [29] W.H. Chen, P. Minguez, M.J. Lercher, P. Bork, Nucleic Acids Res. 40 (2012) D901–D906. [30] I. Xenarios, L. Salwinski, X.Q.J. Duan, P. Higney, S.M. Kim, D. Eisenberg, Nucleic Acids Res. 30 (2002) 303–305. [31] GEO . Last accessed on July 28th, 2014. [32] B.P. Tu, A. Kudlicki, M. Rowicka, S.L. McKnight, Science 310 (2005) 1152–1158. [33] C.S. Reigstad, S.J. Hultgren, J.I. Gordon, J. Biol. Chem. 282 (2007) 21259–21267. [34] N.V. Kirienko, D.S. Fay, Dev. Biol. 305 (2007) 674–684. [35] S. Yellaboina, A. Tasneem, D.V. Zaykin, B. Raghavachari, R. Jothi, Nucleic Acids Res. 39 (2011) D730–735. [36] K.P. O’Brien, M. Remm, E.L. Sonnhammer, Nucleic Acids Res. 33 (2005) D476– 480. [37] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, J. Mol. Biol. 215 (1990) 403–410. [38] M. Kanehisa, S. Goto, Nucleic Acids Res. 28 (2000) 27–30. [39] C. Saveanu, A. Namane, P.E. Gleizes, A. Lebreton, J.C. Rousselle, J. NoaillacDepeyre, N. Gas, A. Jacquier, M. Fromont-Racine, Mol. Cell. Biol. 23 (2003) 4449–4460. [40] R.J. Arnold, J.P. Reilly, Anal. Biochem. 269 (1999) 105–112. [41] B. Sonnichsen, L.B. Koski, A. Walsh, P. Marschall, B. Neumann, M. Brehm, A.M. Alleaume, J. Artelt, P. Bettencourt, E. Cassin, M. Hewitson, C. Holz, M. Khan, S. Lazik, C. Martin, B. Nitzsche, M. Ruer, J. Stamford, M. Winzi, R. Heinkel, M. Roder, J. Finell, H. Hantsch, S.J.M. Jones, M. Jones, F. Piano, K.C. Gunsalus, K. Oegema, P. Gonczy, A. Coulson, A.A. Hyman, C.J. Echeverri, Nature 434 (2005) 462–469. [42] M.E.J. Newman, Phys. Rev. E 67 (2003). [43] R. Pastor-Satorras, A. Vazquez, A. Vespignani, Phys. Rev. Lett. 87 (2001).

Essential protein identification based on essential protein-protein interaction prediction by Integrated Edge Weights.

Essential proteins play a crucial role in cellular survival and development process. Experimentally, essential proteins are identified by gene knockou...
3MB Sizes 0 Downloads 8 Views