Journal of Bioinformatics and Computational Biology Vol. 14, No. 3 (2016) 1650008 (34 pages) # .c Imperial College Press DOI: 10.1142/S0219720016500086

Prediction of protein–protein interaction network using a multi-objective optimization approach

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Archana Chowdhury*, Pratyusha Rakshit† and Amit Konar‡ Arti¯cial Intelligence Laboratory Department of Electronics and Telecommunication Engineering Jadavpur University, Kolkata, India *[email protected][email protected][email protected] Received 10 March 2015 Revised 22 November 2015 Accepted 23 November 2015 Published 5 February 2016 Protein–Protein Interactions (PPIs) are very important as they coordinate almost all cellular processes. This paper attempts to formulate PPI prediction problem in a multi-objective optimization framework. The scoring functions for the trial solution deal with simultaneous maximization of functional similarity, strength of the domain interaction pro¯les, and the number of common neighbors of the proteins predicted to be interacting. The above optimization problem is solved using the proposed Fire°y Algorithm with Nondominated Sorting. Experiments undertaken reveal that the proposed PPI prediction technique outperforms existing methods, including gene ontology-based Relative Speci¯c Similarity, multi-domainbased Domain Cohesion Coupling method, domain-based Random Decision Forest method, Bagging with REP Tree, and evolutionary/swarm algorithm-based approaches, with respect to sensitivity, speci¯city, and F1 score. Keywords: Protein–protein interaction networks; gene ontology; ¯re°y algorithm; nondominated sorting.

1. Introduction Proteins are groups of amino acids linked together by peptide bonds. They play a vital role in organisms and participate in many processes within cells. A single protein is unable to perform most of the biological activities.1 Here lies the signi¯cance of the Protein–Protein Interaction (PPI). Proteins interact with other proteins to form PPIs. These PPIs are important for many biological processes. There are several in vivo methods and in vitro methods for identifying PPIs.2 There are experiments that can identify interactions of proteins on a small scale, while others like high-throughput methods detect protein interactions on a large scale.3–6 Many

1650008-1

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

experiments provide physical interactions among proteins while few experiments provide functional associations among proteins. Physical interactions are mediated by the chemical structure of the two proteins whereas functional interactions regulate the functions of the associated proteins. The high throughput methods may lead to many erroneous results in the form of false positives and false negatives. Overexpression of a pair of proteins, lack of speci¯c post translation modi¯cation, and absence of regulatory proteins in the location may lead to nonspeci¯c interactions and hence to false positives. The condition under which high-throughput experiments are performed may not be favorable for some proteins to continue interaction. Thus, the inability to capture transient interactions often leads to false negative PPI prediction. In addition, the number of interactions within a cell being very high, it is often di±cult to capture individual interactions. Computational methods, however, can overcome the above limitations and thus are gaining importance in bioinformatics research. Various computational approaches have been developed in the past to predict PPIs utilizing di®erent characteristic features of the existing PPIs.7–11 In the present context of predicting possible interaction between two proteins, the objectives for the scoring functions are based on maximizing (1) similarity of their functions, (2) strength between their interacting domain–domain pair, and (3) the number of common neighbor shared by the two proteins predicted to be interacting. The ¯rst criterion of the objective function is based on Gene Ontology (GO) annotation of proteins. GO annotation, referring to a uni¯ed representation of genes and gene product across all species, has been identi¯ed as one of the strongest predictors for protein interaction. GO annotation-driven interaction inference13,14 is based on the observation that proteins localized at the same cellular compartment are more likely to interact than are proteins that reside in spatially distant compartments.15 Similarly, proteins that share a common biological process or molecular function have been found to be predictive for PPI.1 The second criterion deals with protein domains. Domains, the building blocks of proteins, are conserved through evolution to represent protein functions or structures. There has been a vast literature to ascertain that proteins interact with each other if the domains present in them interact with each other. A revolutionary work in this regard is proposed by Sprinzak and Margalit in Ref. 16. Some of the approaches of single domain interaction in proteins include the works in Refs. 17 and 18. Protein interactions based on domain combination-based method is given in Ref. 19. These approaches have analyzed the e®ect of co-occurrence of either single domains or combination of domains in the interacting protein pairs to conclude that the interaction of proteins is due to the presence of these domains, which tend to interact with each other. Thus in our approach we predict a protein to interact with another protein if the interacting domains set of one protein contains the domains with which the domains of other protein can interact. The ¯nal criterion is based on topological feature of proteins. The PPI network is characterized by several topological properties.20 Network topology has been 1650008-2

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

observed to predict PPIs.21 Several graph-based approaches have been used where PPI network is represented as a graph and relational and structural features are extracted from it. The approaches include those proposed in Refs. 2 and 37 where they used relational data of PPI network along with other features of proteins to predict PPIs. One of the topological features is the number of common neighbors shared by the proteins in the PPI network. It has been observed that if two proteins share a large number of common neighbors in the network then the two proteins are predicted to be interacting.26,33 Apparently, it may seem that the strength of the interacting domains of proteins in the interacting protein pair will be maximized simultaneously as the size of the common neighbors and the similarity in the functions assigned to the individual proteins are increased. However, sharing common neighbors between proteins may not always con¯rm the desired structural properties of the interacting proteins (for suitable interaction of domains of the proteins). Similarly, di®erent proteins may be found to possess a large number of shared neighbors but with rare functional similarity required to validate the real-world interaction. Thus, it can be concluded that these three properties are mutually independent and hence need to be optimized simultaneously to validate the predicted PPIs. To accomplish this, PPI problem has been formulated in an evolutionary/swarm Multi-Objective Optimization (MOO) framework where the objectives are jointly maximized. The traditional evolutionary/swarm MOO algorithms commence from a population of randomly initialized candidate (or trial) solutions. In the present context, the candidate solution symbolizes the predicted PPI network. The PPI (symmetric) network of N proteins here has been encoded by a vector of dimension N  ðN  1Þ=2 þ 1. The ¯rst N  ðN  1Þ=2 components of the vector denote the interaction weights of all possible protein pairs. The last component of the vector symbolizes the threshold value used to identify the predicted interacting protein pairs. Two proteins are predicted to be interacting if their interaction weight lies above the threshold value. Otherwise they are predicted to be noninteracting. The relative merit of a candidate solution is assessed by measuring its objective function values. Evidently, in the present context, the quality of a candidate solution (i.e. the vector encoding the predicted PPI network) is estimated from a quantitative evaluation of its ability to maximize the aforementioned three objectives (the function similarity, domain–domain interaction, and the shared common neighbors of interacting proteins pairs) for PPI prediction. The candidate solutions thus randomly initialized are then evolved using evolutionary/swarm dynamics. The candidate solutions then participate in a competitive selection procedure to be selected for the next evolutionary generation. The quality candidate solutions (with better objective function values) are promoted to the next generation and the cycle repeats until the termination criterion is satis¯ed. The best candidate solution obtained after the convergence of the algorithm is then decoded to obtain the predicted PPI network. The predicted interacting and noninteracting protein pairs thus obtained from the

1650008-3

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

best candidate solution are validated using the standard PPI datasets, including BioGrid29 (July 2013) and DIP.42 In this paper, we study the scope of our proposed Fire°y Algorithm with Nondominated Sorting (FANS) for solving PPI prediction problem. It draws inspiration from the collective behavior and biochemical properties of ¯re°ies. FANS is an evolutionary MOO strategy that utilizes the advantages of Fire°y Algorithm (FA)12,40 with the mechanisms of Pareto-based ranking and crowding distance sorting.37,41 This facilitates a faster convergence to Pareto optima and guarantees the uniform spread of solutions. A novel approach is also recommended for improving the performance of FANS by modulating the step size for random movement of each ¯re°y according to its relative dominance capability in the population. The strategy is devised with an aim to drive an inferior ¯re°y by explorative force while a qualitative ¯re°y is con¯ned in its local neighborhood. In this paper, FANS is used to sensibly determine the interacting protein partners to predict the PPI network. This paper is an approach to signi¯cantly improve the work proposed in Ref. 23 in the following counts: (1) it maximizes similarity in functions, similarity in the domain interaction pro¯le and number of common neighbors of proteins predicted to be interacting in PPI network and (2) an MOO algorithm FANS is used to solve the optimization problem. In Ref. 23, the authors used Chaotic Local Search based Bat Algorithm (CLSBA), whereas the present version examines the scope of FANS. The inclusion of a new objective in the formulation and replacement of the BAT Algorithm (BA) by FANS algorithm results in signi¯cant improvement in performance. Experiments performed on Saccharomyces cerevisiae (SC ) dataset, obtained from BioGrid and DIP where negative samples are obtained by randomly pairing the protein pairs and removing those which are already identi¯ed as interacting protein pair, suggest the outperforming power of the proposed method. The proposed method attains high values of precision and recall and is able to predict high number of interacting proteins on unbalanced dataset. The rest of the paper is divided into four sections. Section 2 gives a brief idea about the formulation of the PPI identi¯cation problem and explains the criteria used. In Sec. 3, FANS algorithm is proposed. Section 4 presents the experimental settings and the results. Section 5 concludes the paper. 2. Formulation of PPI Identi¯cation Problem This section attempts to formulate the PPI prediction as an MOO problem. The characteristic features used for the PPI prediction are overviewed to devise the objective function, which on simultaneous optimization returns the desired network. 2.1. Formation of a PPI network A PPI network (symmetric and undirected) of N proteins may involve a maximum of N  ðN  1Þ=2 interactions, ignoring self-interaction. The observation has motivated 1650008-4

Prediction of PPI network using a MOO approach

us to represent the PPI network by a vector Z of dimension 1  D where D¼

N  ðN  1Þ þ 1: 2

ð1Þ

The kth element of Z, denoted by Zk 2 ½0; 1, for k ¼ ½1; 2; . . . ; D  1, symbolizes the predicted weight of interaction wi;j between proteins pi and pj , for i ¼ ½1; 2; . . . ; N  1 and j ¼ ½i þ 1; i þ 2; . . . ; N, such that

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

k ¼ N  ði  1Þ 

i  ði þ 1Þ þ j: 2

ð2Þ

The proof of (2) is given in Appendix A. The Dth element, ZD 2 ½0; 1, denotes the threshold value, Th. The proteins pi and pj are predicted to be interacting if wi;j  Th. Hence the signi¯cance of Th lies in predicting possible noninteraction between the proteins pi and pj provided their weight of interaction wi;j , encoded by Zk (identi¯ed using (2) for k ¼ ½1; 2; . . . ; D  1, i ¼ ½1; 2; . . . ; N  1 and j ¼ ½i þ 1; i þ 2; . . . ; N), falls below the threshold Th. Figure 1 exempli¯es a vector representation of a PPI network consisting of N ¼ 4 proteins. Here the vector Z comprises D ¼ N  ðN  1Þ=2 þ 1 ¼ 7 elements. The ¯rst six elements of Z are used to decode the weights of possible interaction between proteins. The seventh component of Z denotes the threshold Th. The weight of interaction wi;j between proteins pi and pj (for i ¼ ½1; 2; . . . ; N  1 and j ¼ ½i þ 1; i þ 2; . . . ; N) can be identi¯ed by decoding Z using (2). For example, for proteins p2 and p3 (i.e. i ¼ 2 and j ¼ 3), we obtain k ¼ 4 from (2). It signi¯es that the weight of interaction w2;3 between proteins p2 and p3 can be decoded from the k ¼ 4th component of Z, giving w2;3 ¼ Z4 ¼ 0:19. The decoded vector now can be represented as follows. Once the weights have been decoded, they are individually compared to ZD ¼ Th to identify the predicted noninteracting protein pairs. It is evident from Fig. 2 that proteins p2 and p3 are predicted to be noninteracting as w2;3 ð¼ 0:19Þ < Thð¼ 0:52Þ, while entailing interaction between proteins p1 and p2 with

Z

1 0.65

2 0.4

3 0.71

4 0.19

5 0.23

6 0.61

7 0.52

Fig. 1. Example of vector representation of a PPI network with four proteins.

Z Decoded Weights of Interaction

1 0.65 w1,2

2 0.4 w1,3

3 0.71 w1,4

4 0.19 w2,3

5 0.23 w2,4

6 0.61 w3,4

7 0.52 Th

Fig. 2. Example of decoding vector representing a PPI network with four proteins.

1650008-5

A. Chowdhury, P. Rakshit & A. Konar

p1

0.65

p2

0.71 p3

0.61

p4

Fig. 3. Example of a PPI network with four proteins.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

w1;2 ð¼ 0:65Þ > Thð¼ 0:52Þ. After decoding Z and interpreting the interaction weights, we obtain the following PPI network as shown in Fig. 3. 2.2. Predicting PPI using functional characteristic The functional annotation of the proteins is very important. Function assignment of proteins is proteome-wide and is determined by the global connectivity pattern of the protein network. Consequently it is a challenging task to design a PPI network based on the functional assignments of the proteins. The assessment of the functional similarity of interacting proteins in Ref. 24 has motivated us to consider protein functions as an in°uential characteristic feature of PPI prediction. The development of function annotation schemes, such as GO, have made it possible to combine the semantic information of proteins with the protein functional category to more e®ectively predict the topological structural information of protein interactions. To be more speci¯c, the protein functions are annotated using GO terms. GO comprises three orthogonal ontologies, biological process, molecular function, and cellular component. Each of these three ontologies can be represented as directed acyclic graphs where nodes represent the GO terms and their relationships are symbolized by edges. A GO term may have multiple parents (or ancestors) and children GO terms. The hierarchical structure with three GO terms, including GO: 0008150, GO: 0006974, and GO: 0006281 is given in Fig. 4 as an illustrating example. The ancestors of GO: 0006281 are GO: 0008150 and GO: 0006974. There are two types of relationships. \is a" is used to denote that the child GO term is a subclass of the parent GO term while \part-of" relationship represents the child as a

GO: 0008150

GO:0006281 is_a GO:0006974 GO: 0006974

GO: 0006281

Fig. 4. Hierarchical representation of GO terms having is a relationship. 1650008-6

Prediction of PPI network using a MOO approach

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

component of the parent. The \is a" relationship is used as reference throughout the paper. Apparently, functionally similar proteins are expected to interact with high certainty. Proteins are functionally similar if they possess analogous molecular functions and are involved in similar biological processes, while being located at the same cellular compartment. The similarity between any two functions f and f 0 (both involved either in molecular or biological or cellular processes) can be evaluated by a similarity score between the respective GO terms annotating functions f and f 0 . Let GO(f) and GO(f 0 ) signify the GO term annotating protein functions f and f 0 , respectively. Hence, simðf; f 0 Þ ¼ simðGOðfÞ; GOðf 0 ÞÞ:

ð3Þ

There are evidences22 where semantic similarity measure between two proteins is used to infer their possible interaction. The semantic similarity between two proteins is de¯ned as the average similarity of all GO terms annotating functions of two proteins. Each protein pair receives three similarity values, one for each ontology, i.e. biological process, molecular function, and cellular component. Let Fbio ðpÞ, Fmol ðpÞ, and Fcell ðpÞ denote the set of functions of protein p, being annotated by GO terms belonging to the above said three respective ontology categories. Intuitively, the score of interaction possibility between any two proteins pi and pj based on their functional similarity is then given by 1 sf ðpi ; pj Þ ¼  ½sf bio ðpi ; pj Þ þ sf mol ðpi ; pj Þ þ sf cell ðpi ; pj Þ; ð4Þ 3 where   sf bio ðpi ; pj Þ ¼ max max ðsimðGOðfÞ; GOðf 0 ÞÞÞ 8f2Fbio ðpi Þ

sf

mol ðpi ; pj Þ ¼

sf

cell ðpi ; pj Þ ¼

8f 0 2Fbio ðpj Þ



max

8f2Fmol ðpi Þ



max

8f2Fcell ðpi Þ

max

8f 0 2Fmol ðpj Þ

max

8f 0 2Fcell ðpj Þ

 ðsimðGOðfÞ; GOðf 0 ÞÞÞ :

ð5Þ



ðsimðGOðfÞ; GOðf 0 ÞÞÞ

The formulations (4) and (5) ensure proteins pi and pj are likely to interact with high certainty if they possess at least one common function for each of the three ontologies. For example, let us consider only the biological functionalities of proteins pi and pj with Fbio ðpi Þ ¼ fx; yg and Fbio ðpj Þ ¼ fx; zg, respectively. Then, from (5), we obtain   0 sf bio ðpi ; pj Þ ¼ max max ðsimðGOðfÞ; GOðf ÞÞÞ 8f2Fbio ðpi Þ

¼ max

8f 0 2Fbio ðpj Þ

maxðsimðGOðxÞ; GOðxÞÞ; simðGOðxÞ; GOðzÞÞÞ maxðsimðGOðyÞ; GOðxÞÞ; simðGOðyÞ; GOðzÞÞÞ

! :

ð6Þ

Now it is evident that the similarity score between two similar functions is maximum, leading to sf bio ðpi ; pj Þ ¼ simðGOðxÞ; GOðxÞÞ which is apparently very high. 1650008-7

A. Chowdhury, P. Rakshit & A. Konar

Thus it ensures a high degree of certainty of interaction between proteins pi and pj though they have two dissimilar biological functions y and z, respectively. The similarity score between GO(f) and GO(f 0 ) has been adopted from Ref. 22 and is given as follows. simðGOðfÞ; GOðf 0 ÞÞ ¼

max



g2CAðGOðfÞ;GOðf 0 ÞÞ

 2  log pðgÞ  ð1  pðgÞÞ ; log pðGOðfÞÞ þ log pðGOðf 0 ÞÞ

ð7Þ

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

where CA(GO(f), GO(f 0 )) represents the set of all common ancestor GO terms of GO(f) and GO(f 0 ). Here pðtÞ represents the probability of functional annotation of a GO term t, given by pðtÞ ¼

freqðtÞ freqðrootÞ

and

ð8Þ

X

freqðtÞ ¼ annoðtÞ þ

freqðcÞ;

ð9Þ

c2childrenðtÞ

where anno(t) represents the number of proteins (in the database) having functions being annotated by the GO term t, children(t) denotes the set of children of t obtained from the GO tree and freq(root) represents the frequency of functional annotation of the root term in the GO tree. It is evident that more general (or speci¯c) a function, higher (or lower) the probability of annotation. Hence p(root) is the maximum as the root GO term is the most general term, annotating a large fraction of protein functions in the network. Thus formulation of (7) ensures that the similarity score between GO(fÞ and GO(f 0 ) will be high (i) if the commonality between GO(fÞ and GO(f 0 ), being captured by their common ancestors is more. (ii) if the lowest common ancestor of GO(fÞ and GO(f 0 ) (i.e. argð maxg2CAðGOðfÞ;GOðf 0 ÞÞ ð log pðgÞÞÞÞ is more speci¯c with a low value of pðgÞ (i.e. a high value of 1  pðgÞ). Therefore, maximization of the functional similarity between any two predicted interacting proteins pi and pj in the network, given as J1 ¼

N 1 X N X

maxðwi;j  sf ðpi ; pj Þ; ð1  wi;j Þ  ð1  sf ðpi ; pj ÞÞÞ;

ð10Þ

i¼1 j¼iþ1

will yield a better and more accurate prediction of interacting protein partners. Expression (10) provides a high value of J1 if both wi;j (the predicted interaction weight between two proteins pi and pj Þ and sf ðpi ; pj Þ (their functional similarity) are comparable to each other. If pi and pj are predicted to be interacting (or noninteracting) with a high (or a low) value of wi;j and if they truly have a high (or low) 1650008-8

Prediction of PPI network using a MOO approach

functional similarity, it will make wi;j  sf ðpi ; pj Þ (or(1  wi;j Þ  ð1  sf ðpi ; pj ÞÞ high, which will in turn maximize J1 . A wrong prediction of wi;j will reduce the value of J1 . For example, let sf ðpi ; pj Þ is too low (or high). Then prediction of a high (or low) value of wi;j will make J1 low.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

2.3. Domain–domain interaction characteristic Domains are considered as the structural and/or functional units of proteins and are conserved through various generations to represent protein functions or structures. The domain interaction provides framework for prediction model of PPI.25 Here, we propose a quantitative measure to infer PPI network that depends on structural domains of proteins. The idea stems from the fact that proteins interact with each other through domains. Thus, in domain-based approach, the strength of interaction between domains of proteins is used to predict the interaction between proteins. Let DðpÞ be the set of domains present in protein p and DI(p) be the set of existing domains interacting with domains of protein p. Apparently, DI(p) also includes DðpÞ indicating self-interaction between domains. Two proteins pi and pj now can be predicted to be possibly interacting (i) If they have large fraction of domains common in their corresponding DI sets. (ii) If there are many biological and computational evidences substantiating interaction between any two domains d 2 DIðpi Þ and d 0 2 DIðpi Þ. Considering the above requirements, the strength of interaction between proteins pi and pj based on their domain interaction pro¯le is given by sd ðpi ; pj Þ ¼

jDIðpi Þ \ DIðpj Þj jDIðpi Þ [ DIðpj Þj 0 0 X 1 @ 1 @ jDIðpi Þj jDIðpj Þj d2DIðpi Þ

X

11 strengthðd; d 0 ÞAA;

ð11Þ

d 0 2DIðpj Þ

where strengthðd; d 0 Þ ¼

md;d 0 M

ð12Þ

with md;d 0 as the number of sources out of a total of N sources that validate the interaction between domains d and d 0 . Here strength(d; d 0 ) represents the strength of interaction between d and d 0 on the basis of the number of sources (evidences) substantiating their interaction. Hence higher value of either DIðpi Þ \ DIðpj Þ or md;d 0 or both ensure high possibility of interaction between proteins pi and pj . Let us consider an example with two proteins pi and pj with their respective domain sets given as Dðpi Þ ¼ fd1 ; d2 ; d4 g and Dðpj Þ ¼ fd2 ; d4 ; d6 ; d8 g. Now let us consider that domain d1 interacts with d9 , domain d2 interacts with d10 , d4 interacts with d7 , d6 , and d8 interact with d9 . Thus, the possible sets of interacting domains for 1650008-9

A. Chowdhury, P. Rakshit & A. Konar

respective proteins are identi¯ed as DIðpi Þ ¼ fd1 ; d2 ; d4 ; d7 ; d9 ; d10 g and DIðpj Þ ¼ fd2 ; d4 ; d6 ; d7 ; d8 ; d9 ; d10 g. The extent of support towards positive interaction between pi and pj by the commonality of domains between their respective domain interaction pro¯les is given by

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

jDIðpi Þ \ DIðpj Þj 5 ¼ : jDIðpi Þ [ DIðpj Þj 8

ð13Þ

Moreover, let the total number of sources used for authenticating domain interactions be M ¼ 10 for the exemplar case. It is evident that the total number of possible domain–domain interactions in the present context is jDIðpi Þj  jDIðpj Þj ¼ 42 for jDIðpi Þj ¼ 6 and jDIðpj Þj ¼ 7. Among these 42 possible interactions, we consider the case for the domains d4 2 DIðpi Þ and d7 2 DIðpj Þ. Let the interaction between d4 and d7 has been manifested by md4 ;d7 ¼ 4 sources out of M ¼ 10 sources. Then the strength of interaction between these domains is given as strengthðd4 ; d7 Þ ¼

md4 ;d7 4 : ¼ 10 M

ð14Þ

The same measure is evaluated for all remaining 41 possible domain–domain interactions of proteins pi and pj . Therefore, the maximization of the similarity between two predicted interacting proteins based on their interacting domains sets, given as J2 ¼

N 1 X N X

maxðwi;j  sd ðpi ; pj Þ; ð1  wi;j Þ  ð1  sd ðpi ; pj ÞÞÞ

ð15Þ

i¼1 j¼iþ1

is expected to capture the predicted interacting protein pairs. As in case of J1 , here also prediction of high (or low) interacting weights wi;j for proteins pi and pj with strong (or weak) evidences of their domain–domain interactions ensures a high value of J2 . 2.4. Predicting PPI using neighborhood topology There exist vast literatures26,37 indicating possibility of interaction between a pair of proteins being proportional to the size of their common neighborhood. The size of the common neighborhood of protein pairs pi and pj can be determined by identifying the number of proteins pk in the network, which are directly interacting with both pi and pj . Protein pk is predicted to be interacting with pi and pj if the corresponding weights of interaction wi;k and wj;k are both greater than the threshold value Th. Apparently, the weight of interaction wi;j between proteins pi and pj is captured by the number of proteins in their common neighborhood. Let the common neighborhood of pi and pj be represented by ni;j . Hence wi;j / jni;j j=N

1650008-10

ð16:1Þ

Prediction of PPI network using a MOO approach

and ni;j i the set of proteins pk in the network with wi;k ðor wk;i Þ > Th and wj;k ðor wk;j Þ > Th:

ð16:2Þ

Eventually the accuracy in predicting the interacting protein pairs pi and pj in a PPI network can be assessed by measuring the similarity in their predicted weight of interaction wi;j and their common neighborhood ratio jni;j j=N. The above requirement can be accomplished by maximizing (17).

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

sn ðpi ; pj Þ ¼

1 jwi;j  jni;j j=Nj þ "

:

ð17Þ

Here " is a small positive constant, however small. Therefore, maximization of (18) may ensure the appropriateness in predicting the interaction weights between proteins in a network. J3 ¼

N 1 X N X

sn ðpi ; pj Þ:

ð18Þ

i¼1 j¼iþ1

The above discussion has motivated us to recast the mathematical problem of PPI prediction problem in an MOO framework for simultaneous maximization of three objectives, (i) J1 for assuring functional similarity between interacting protein pairs. (ii) J2 for ensuring domain–domain interaction between protein pairs. (iii) J3 for a±rming the proportional relationship between the shared common neighbors of interacting proteins pairs and their predicted interaction weight. 3. Fire°y Algorithm with Nondominated Sorting (FANS) In Fire°y Algorithm12 with Nondominated Sorting27 (FANS), the position of a ¯re°y represents a possible solution of the optimization problem and the light intensity at the position of the ¯re°y corresponds to the ¯tness of the associated solution. An overview of the main steps of the FANS algorithm for jointly maximizing all L objectives is presented next. (I) Initialization: FANS involves a population PG of NP, D-dimensional ¯re°y positions Zi ðGÞ ¼ fzi;1 ðGÞ; zi;2 ðGÞ; . . . ; zi;D ðGÞg at the current generation G ¼ 0 min min randomly initialized in the range ½Z min ; Z max , where Z min ¼ fz min 1 ; z2 ; . . . ; zD g max max and Z max ¼ fz max 1 ; z 2 ; . . . ; z D g. The jth element of the ith ¯re°y position is thus initialized by max zi;j ð0Þ ¼ z min ð0Þ  z min j ð0Þ þ randð0; 1Þ  ðz j j ð0ÞÞ

ð19Þ

for j ¼ ½1; D. Here rand(0,1) is a uniformly distributed random number in the range (0, 1). The kth objective function Jk ðZi ð0ÞÞ of Zi ð0Þ is evaluated for i ¼ ½1; NP and k ¼ ½1; L. 1650008-11

A. Chowdhury, P. Rakshit & A. Konar

(II) Identi¯cation of Dominating Sets: Corresponding to each ¯re°y Zi ðGÞ, two sets of candidates are identi¯ed from the current generation population PG . The ¯rst set, denoted by Set 1i ðGÞ, comprises the position vectors of the ¯re°ies dominating Zi ðGÞ. To be more speci¯c, a member Zj ðGÞ 2 Set 1i ðGÞ, for j ¼ ½1; NP but i 6¼ j, should satisfy the following conditions.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

(a) Jk ðZj ðGÞÞ  Jk ðZi ðGÞÞ for k ¼ ½1; L. (b) Jl ðZj ðGÞÞ > Jl ðZi ðGÞÞ for at least one l 2 ½1; L. Similarly, the second set, Set 2i ðGÞ, is constructed by including the position vectors of ¯re°ies which are dominated by Zi ðGÞ. This categorization procedure is repeated for all ¯re°ies with i ¼ ½1; NP. (III) Attraction to Brighter Fire°ies: Now the ¯re°y Zi ðGÞ is attracted towards the positions of the brighter dominating ¯re°ies Zj ðGÞ 2 Set 1i ðGÞ such that Zj ðGÞ dominates Zi ðGÞ. The attractiveness  i;j of Zi ðGÞ towards Zj ðGÞ is proportional to the light intensity seen by adjacent ¯re°ies. However, attractiveness i;j decreases exponentially with the distance between them, denoted by ri;j as given in (20). i;j ¼  o expð  r m i;j Þ;

m  1;

ð20Þ

where  0 denotes the maximum attractiveness experienced by the ith ¯re°y at its own position (i.e. at ri;j ¼ ri;i ¼ 0) and  is the light absorption coe±cient, which controls the variation of  i;j with ri;j . This parameter is responsible for the convergence speed of FA.28 A setting of  ¼ 0 leads to constant attractiveness while  approaching in¯nity is equivalent to complete random search.28 In (20), m is a positive constant representing a nonlinear modulation index. The distance between Zi ðGÞ and Zj ðGÞ is computed using Euclidean norm as follows. ri;j ¼ jjZi ðGÞ  Zj ðGÞjj:

ð21Þ

This step is repeated for i ¼ ½1; NP. (IV) Movement of Fire°ies: The ¯re°y at position Zi ðGÞ ¯rst stores its current position in its memory, symbolized by Z cur i ðGÞ and then moves towards a more attractive position Zj ðGÞ 2 Set 1i ðGÞ occupied by a brighter ¯re°y j following the dynamics given in (22). cur Z next ðGÞ ¼ Z cur i i ðGÞ þ  i;j  ðZj ðGÞ  Z i ðGÞÞ þ   ðr  0:5Þ;

Z cur i ðGÞ

Z next ðGÞ: i

ð22:1Þ ð22:2Þ

The movement of the ith ¯re°y, governed by (22), is continued for j ¼ ½1; jSet 1i ðGÞj. The ¯rst term in the position updating formula (22.1) represents the ¯re°y's position after the last movement. The second term in (22.1) denotes the change in the position of the ¯re°y at Zi ðGÞ due to the attraction towards the brighter ¯re°y at Zj ðGÞ 2 Set 1i ðGÞ. Hence it is apparent that the brightest ¯re°y with no more attractive ¯re°y in the current population PG will have no motion due 1650008-12

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

to the second term and may get stuck at the local optima. To circumvent the problem, the last term is introduced in (22.1) for the random movement of the ¯re°ies with a maximum stepsize of  2 ð0; 1Þ. Here r is a D-dimensional vector with its jth component rj being a random number uniformly distributed in the range (0, 1). After completion of its journey mediated by the brighter ones, the updated position of the ith ¯re°y is represented by Z next ðGÞ for i ¼ ½1; NP. It is to be noted i that the ith ¯re°y memorizes both of its positions before starting its motion, i.e. Zi ðGÞ, and after completing its journey, i.e. Z next ðGÞ. This step is repeated for i i ¼ ½1; NP. It is noteworthy that the random movement of a ¯re°y with step size  in (22.1) in traditional FA12 helps the population individuals to avoid local optima by their expedition pro¯ciency. Particularly, the convergence of ¯re°ies towards global optima greatly relies on the step size () pro¯le. However in traditional FA,  is taken to be constant for all ¯re°ies in the current population, irrespective of their ¯tness. Consequently, ¯re°ies in vicinity of the global optima may be deviated away (with  value greater than the requirement) and may get trapped at local optima. Contrarily, ¯re°ies far away from the global optima in the ¯tness landscapes (with  smaller than necessity), may not be given any opportunity to be attracted towards the global optimum. To overcome this problem, , used for random movement of a ¯re°y, needs to be modulated with its relative merit over other members of the population PG . It is realized here by setting i ¼ 1  jSet 2i ðGÞj=NP

for i ¼ ½1; NP:

ð23Þ

It is evident from (23) that greater (or smaller) the size jSet 2i ðGÞj of the set of members of PG being dominated by Zi ðGÞ, less (or more) is its corresponding step size value. It in turn ensures that the quality ¯re°ies (which dominates a large fraction of the population) should search in the local neighborhood with a small stepsize to prevent the omission of the global optima whereas a poor performing member (which is most frequently dominated by its competitors) should participate in the global search to explore promising regions. The kth objective function Jk ðZ next ðGÞÞ of Z next ðGÞ is evaluated for i ¼ ½1; NP and k ¼ ½1; L. i i ðGÞ dominates Zi ðGÞ, Z next ðGÞ replaces Zi ðGÞ in the memory (V) Selection: If Z next i i ðGÞ are nondominated, both positions of the ith ¯re°y. However, if Zi ðGÞ and Z next i are kept in her memory PG . This step is reiterated for i ¼ ½1; NP and hence, a population of ¯re°y positions is achieved with size jPG j 2 ½NP; 2NP. (VI) Nondominated Sorting: The population PG , thus obtained, is sorted into a number of Pareto fronts following the nondomination principle. All the nondominated ¯re°y positions of the current population are ranked one and are included in the optimal Pareto front, Front Set(1). The second front Front Set(2) is formed by the nondominated ¯re°y positions of the set fPG  Front Setð1Þg. Continuation of this Pareto ranking process eventually identi¯es all the nondominated sets and rank them as Front Set(1), Front Set(2), Front Set(3), and so on. 1650008-13

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

(VII) Truncation of Extended Population: The nondominated set of ¯re°y positions are ¯ltered from PG (of size NP < jPG j < 2NP) to pass on to the next generation population PGþ1 (of size NP) according to the ascending order of their Pareto ranking. Let Front Set(l) be the set such that by adding Front Set(l) to PGþ1 , jPGþ1 j exceeds NP. Then the ¯re°y positions in Front Set(l) are sorted in descending order of crowding distance CD,27 revealing the perimeter of a hypercube formed by their nearest neighbors at the vertices in the ¯tness landscapes. To ensure diversity in population, the ¯re°y positions in Front Set(l) with the highest crowding distances are prioritized for being included in PGþ1 until jPGþ1 j becomes NP. After each evolution, we repeat from step (II) until termination condition for convergence is satis¯ed. The pseudo-code for the proposed FANS algorithm with L objectives is given below. Procedure FANS Begin 1. Initialize a population PG of NP, D-dimensional ¯re°y position vectors Zi ðGÞ at generation G ¼ 0 using (19) for i ¼ ½1; NP. 2. Evaluate Jk ðZi ðGÞÞ for i ¼ ½1; NP and k ¼ ½1; L. 3. While termination condition is not reached do Begin 3.1. Identify the sets Set 1i ðGÞ and Set 2i ðGÞ corresponding to Zi ðGÞ for i ¼ ½1; NP following the principle given in Sec. 3(II). 3.2. Store Zi ðGÞ in Z cur i ðGÞ and perform its movement towards all Zj ðGÞ 2 Set 1i ðGÞ to generate a new position Z next ðGÞ following the dynamics in (22) i and (23) for i ¼ ½1; NP. 3.3. Evaluate Jk ðZ next ðGÞÞ for i ¼ ½1; NP and k ¼ ½1; L. i 3.4. If Z next ðGÞ dominates Zi ðGÞ Then replace Zi ðGÞ with Z next ðGÞ; i i next Else If Z i ðGÞ are Zi ðGÞ nondominated Then PG PG [ Z next ðGÞ; i End If Repeat the step for i ¼ ½1; NP. 3.5. Sort PG into subsequent Pareto fronts Front Set using nondominated sorting principle. 3.6. Include the ¯re°y positions from the Pareto fronts Front Set of PG into PGþ1 starting from Front Set(1) until Front Set(l) is found such that jPGþ1 j þ jFront SetðlÞj > NP. Sort the position vectors in Front Set(l) in descending order of crowding distance and set PGþ1 PGþ1 [ topðNP jPGþ1 j) ¯re°y position vectors of Front Set(l). 3.7. G G þ 1. End While. End. 1650008-14

Prediction of PPI network using a MOO approach

4. Experiments and Results

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

4.1. Database used In our simulation, protein interaction data of SC is acquired from two databases, including BioGrid29 (July 2013) and DIP.42 We have selected SC as the test organism for our work because there is more information about its PPIs than any other organism. To analyze the e±ciency of our proposed algorithm to predict PPI and to validate the predictions, the predicted protein interactions are compared with the protein interaction data of BioGrid and DIP. The BioGrid database consists of 6,391 proteins and 326,967 interactions. The DIP database consists of 5,173 proteins and 24,627 interactions. Since our proposed method of predicting PPI network requires no training dataset, it is not biased towards the selection of noninteracting protein pairs which are not readily provided by BioGrid or DIP databases. However, the standard noninteracting (or negative) datasets play an important role to validate our proposed method. Two methods are generally used to create negative dataset. First method is to select those protein pairs as noninteracting whose cellular localization annotation is di®erent. Second method of generating noninteracting protein pair is by randomly pairing the proteins and removing those pairs, which are already identi¯ed as positive pair. The bias of using non-colocalized protein pair as negative set is pointed out by Ben-Hur and Noble in Ref. 43. Thus in our work we have used the second method of generating the noninteracting dataset. The Cartesian coordinates of the proteins in SC are acquired from Protein Data Bank.30 The GO terms of each protein for evaluating functional similarity is obtained from Saccharomyces Genome Database.31 Total number of GO terms belonging to each category of biological process, cellular component, and molecular function are 27,273, 3,737, and 9,901, respectively. The maximum depth of the GO tree for biological process is 16, for cellular component is 13, and for molecular function is 15. The protein domain information is gathered from Pfam,32 a database of protein domain family that contains multiple sequence alignments of common domain families. In total, there are 4,293 Pfam domains de¯ned by the set of proteins in SC. The domain–domain interaction is obtained from DOMINE39 database. According to DOMINE database, the total number of sources M of (12) is set as 10 (including iPfam, 3did, ME, RCDP, p-value, Fusion, LP, DPEA, RDFF, and DIMA). 4.2. Comparative framework and parameter settings We have compared our proposed method with other computational PPI prediction methods including Relative Speci¯c Similarity (RSS) method using GO terms,34 Random Decision Forest (RDF) method with domain-based approach,35 Domain Cohesion Coupling (DCC) method with multi-domain collaboration approach,36 Bagging with REP Tree (REP Tree),37 and CLSBA.23 In RSS method Z-score value ranging from 0.1 to 0.9 are used to determine the statistical signi¯cance of assigning protein pairs into categories with di®erent RSS values. In RDF method, the three 1650008-15

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

stopping criteria to limit the tree size are node impurity threshold, which is 0.01, the minimum number of samples at a node, which is 3 and the maximum tree level, which is 450. In DCC method, the balancing parameter was set to 1.0 and Interaction Probability (IP) threshold was changed from 0.0 to 1.0. In bagging with REP Tree approach the link-structure of PPI graph is combined with the information about proteins to predict protein interactions. Here a classi¯er is built by resampling and combining the results from numerous iterations and averaging the results across the diverse iterations. In CLSBA-based approach, the prediction of PPI was induced by signi¯cant protein characteristics, namely phylogenetic pro¯le, accessible solvent area and domain interaction pro¯les. In this work,23 a chaotic behavior following the nonlinear dynamics of logistic map is used to enhance the exploration capability of the traditional BA. We have also compared the proposed FANS-based PPI prediction approach with Multi-Objective Particle Swarm Optimization (MOPSO)38based PPI prediction technique. In MOPSO, W (inertia weight) was set to 0.4 and the random numbers R1 and R2 are taken in the range of [0, 1]. The same objectives given in (10), (15), and (18) are considered for MOPSO-based simulations of PPI prediction problem. Best parameter settings for all competitor algorithms are employed in this paper as stated in their respective sources. 4.3. Performance metrics The PPI network obtained by the proposed method is compared with the standard PPI network obtained through experiments. Four di®erent classes of interaction can be observed namely True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Based on these four interconnection classes, we have considered well-known metrics enlisted in Table 1 to compare the relative performance of our proposed PPI prediction algorithm with its competitors. Higher the Table 1. Performance metrics for PPI prediction. Performance metrics

Expressions

Sensitivity (Recall)

TP TPþFN

Speci¯city

TN FPþTN

Positive Likelihood Ratio (PLR) Negative Likelihood Ratio (NLR) Precision/positive predicted value (PPV)

Sensitivity=ð1  SpecificityÞ ð1  SensitivityÞ=Specificity

Negative Predicted Value (NPV)

TN TNþFN

Accuracy

TPþTN TPþTNþFPþFN

F1 score

2TP 2TPþFPþFN

Mathews Correlation Coe±cient (MCC)

ðTPTNÞðFPFNÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi

Receiver Operating Curve (ROC) Area under curve (AUC)

Plot of Sensitivity against (1  Specificity) Area under ROC curve

TP TPþFP

ðTPþFPÞðTPþFNÞðTNþFPÞðTNþFNÞ

1650008-16

Prediction of PPI network using a MOO approach

values for all the metrics, except for NLR, better is the performance of a PPI prediction algorithm. The smaller the NLR, the lesser is the likelihood of FN.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

4.4. Results and performance analysis According to the discussion in Sec. 2, the PPI prediction problem now boils down to an MOO problem to optimally determine the existence of plausible interaction between any two proteins by jointly satisfying three individual objectives as given in (10), (15), and (18). The optimized PPI network can be obtained by decoding the best ¯re°y position from the approximate Pareto front A (Front Set(1) of FANS). It is, however, noteworthy that all position vectors in A are equally good. To select the best one among many possible candidates, the following composite measure is considered for each ¯re°y position vector Zi 2 A. JðZi Þ ¼ J 1 ðZi Þ  J 2 ðZi Þ  J 3 ðZi Þ for i ¼ ½1; jAj;

ð24Þ

where jAj is the number of nondominated solutions in A and , jAj J k ðZi Þ ¼ Jk ðZi Þ

X

Jl ðZk Þ

ð25Þ

l¼1

represents the normalized estimate of Jk ðZi Þ 2 ð0; 1Þ for k ¼ ½1; 3. The e®ective nondominated ¯re°y position Z 2 A having the highest JðZi Þ for i ¼ ½1; jAj is now identi¯ed for decoding the optimal PPI network obtained by FANS. The code for FANS and the PPI obtained by applying FANS on BioGrid (July 2013) dataset is provided in a supplementary ¯le and uploaded at the website `http://www.4shared. com/folder/bIEvMUOl/ online.html'. The e®ect of individual as well as combination of objectives to predict PPIs for two PPI databases (BioGrid and DIP) under consideration is evaluated and represented in Tables 3 and 4, respectively. The e±cacy of individual objectives of predicting PPIs are studied using single objective evolutionary/swarm algorithm. FA is selected for this purpose. Similarly, the potential of all possible combinations of objectives for PPI prediction are studied using multi-objective FANS algorithm. It results in seven variants of evolutionary/swarm PPI prediction algorithms as given in Table 2. The mean and standard deviation of best-of-run values of the performance metrics for 25 independent runs of each PPI prediction algorithms, listed in Table 2, are reported in Tables 3 and 4 for two databases, respectively. The standard deviation of a metric value is given in the parenthesis below its respective mean value. To compare the means of each performance metric obtained by the best and the second best algorithm, paired two-tailed t-test has been used. The statistical signi¯cance of the di®erence of mean of two best algorithms is provided in the last rows of Tables 3 and 4. Note that here \þ" indicates that the t-value of 24 degrees of freedom is signi¯cant at a 0.05 level of signi¯cance, whereas \" means that the di®erence of means is not statistically signi¯cant, and \NA" stands for not 1650008-17

A. Chowdhury, P. Rakshit & A. Konar Table 2. Competitive evolutionary/swarm PPI prediction algorithms for comparing individual objectives and their combinations.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

PPI prediction algorithm

Evolutionary/swarm algorithm

FANS

Multi-objective FANS

FANS Fun Dom

Multi-objective FANS

FANS Fun Neigh FANS Dom Neigh

Multi-objective FANS Multi-objective FANS

FA Fun FA Dom FA Neigh

Single objective FA Single objective FA Single objective FA

Objective (s) Functional similarity, domain interaction pro¯le, and neighborhood topology Functional similarity and domain interaction pro¯le Functional similarity and neighborhood topology Domain interaction pro¯le and neighborhood topology Functional similarity Domain interaction pro¯le Neighborhood topology

applicable, covering cases for which two or more algorithms achieve the best metric values. The best metric value achieved by an algorithm is marked in bold. It can be observed from Tables 3 and 4, that FANS-based realization of PPI prediction utilizing the composite bene¯t of maximizing all three objectives, based on protein functions, its domain interaction pro¯le, and neighborhood topology, attains the best rank among all seven contenders for predicting PPI in case of both the databases under consideration. Next experiment undertaken involves comparison of our proposed PPI prediction algorithms with the existing state-of-the-art techniques, as listed in the comparative framework. The Receiver Operating Curves (ROCs) for di®erent PPI prediction algorithms for interactions obtained from BioGrid database is plotted in Fig. 5. ROCs are a useful technique for examining the e±cacy of a prediction algorithm in inferring true \interacting" and \noninteracting" pairs of proteins. The curve plots Sensitivity against (1  Specificity). To draw the ROCs for the evolutionary methods, e.g. FANS, MOPSO, and CLSBA, the solution vector of Fig. 1 is considered to encode only N  ðN  1Þ=2 possible interaction weights of the network, excluding the threshold Th. In the present context, Th is considered to be a user input (constant value) which is used to identify the noninteracting protein pairs having interaction weights below Th. The ROCs for all contender methods are drawn for varying thresholds (ranging from 0.3 to 0.8). The relative positions of the ROC curves indicate the relative e±ciency of the algorithms to predict truly interacting protein pairs. A plot, corresponding to algorithm X, lying above and to the left to the plot for another algorithm Y , indicates greater e±ciency of X over Y for PPI prediction. Based on the above-mentioned fact, it is evident from Fig. 5 that plots the sensitivity and (1  specificity) values of Table 6 that FANS exhibits highest e±ciency. A quantitative measure of the ROC induced e±ciency of a PPI prediction algorithm can be captured by its respective Area under Curve (AUC), as reported in Table 5. It is apparent from Fig. 5 and Table 5 that AUCs for FANS, MOPSO, and CLSBA employing evolutionary/swarm 1650008-18

1650008-19

Stat. Sig.

FA Neigh

FA Dom

FA Fun

FANS Dom Neigh

FANS Fun Neigh

FANS Fun Dom

FANS

Algorithms

0.9411 (0.082) 0.9153 (0.092) 0.9083 (0.110) 0.8932 (0.190) 0.8897 (0.194) 0.8806 (0.216) 0.8760 (0.246) þ

Sensitivity 0.9556 (0.132) 0.9138 (0.132) 0.9023 (0.264) 0.8850 (0.308) 0.8605 (0.319) 0.8297 (0.322) 0.8264 (0.346) þ

Speci¯city 19.945 (0.141) 10.620 (0.208) 9.3038 (0.282) 7.7693 (0.288) 6.3794 (0.297) 5.1739 (0.311) 5.0492 (0.378) þ

PLR 0.0943 (0.106) 0.0926 (0.130) 0.1015 (0.177) 0.1206 (0.275) 0.1281 (0.287) 0.1438 (0.299) 0.1499 (0.320) þ

NLR 0.9723 (0.006) 0.9261 (0.294) 0.9166 (0.333) 0.9132 (0.367) 0.8268 (0.387) 0.8255 (0.447) 0.8221 (0.466) þ

Precision 0.8898 (0.109) 0.8823 (0.112) 0.8652 (0.130) 0.8589 (0.146) 0.8557 (0.194) 0.8413 (0.211) 0.8071 (0.258) þ

NPV 0.9034 (0.060) 0.9021 (0.074) 0.8894 (0.082) 0.8867 (0.227) 0.8537 (0.285) 0.8448 (0.314) 0.8029 (0.343) þ

Accuracy 0.9432 (0.062) 0.9118 (0.074) 0.9054 (0.097) 0.8980 (0.103) 0.8594 (0.115) 0.8366 (0.116) 0.8207 (0.120) þ

F1 score

0.9487 (0.005) 0.8962 (0.061) 0.8662 (0.132) 0.8662 (0.206) 0.8163 (0.230) 0.8076 (0.247) 0.8036 (0.248) þ

MCC

Table 3. Comparison of individual objectives and their combination for proposed PPI prediction algorithm for 25 runs for BioGrid database (standard deviation in parenthesis below mean metric value).

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

1650008-20

Stat. Sig.

FA Neigh

FA Dom

FA Fun

FANS Dom Neigh

FANS Fun Neigh

FANS Fun Dom

FANS

Algorithms

0.9213 (0.110) 0.9042 (0.159) 0.9014 (0.168) 0.8727 (0.204) 0.8625 (0.242) 0.8524 (0.300) 0.8227 (0.365) þ

Sensitivity 0.9555 (0.156) 0.9426 (0.191) 0.8117 (0.218) 0.7597 (0.258) 0.7500 (0.347) 0.7440 (0.359) 0.7350 (0.407) þ

Speci¯city 20.704 (0.146) 15.771 (0.164) 4.7897 (0.170) 3.6328 (0.191) 3.4505 (0.217) 3.3300 (0.227) 3.1052 (0.295) þ

PLR 0.0823 (0.136) 0.1015 (0.139) 0.1214 (0.162) 0.1675 (0.165) 0.1832 (0.192) 0.1983 (0.224) 0.2411 (0.267) þ

NLR 0.9356 (0.006) 0.9226 (0.031) 0.8079 (0.039) 0.7939 (0.166) 0.7872 (0.167) 0.7860 (0.185) 0.7517 (0.278) −

Precision 0.9007 (0.096) 0.8651 (0.113) 0.8622 (0.115) 0.8149 (0.210) 0.7931 (0.231) 0.7707 (0.259) 0.7417 (0.263) þ

NPV 0.9034 (0.116) 0.8912 (0.120) 0.8631 (0.136) 0.8372 (0.162) 0.8234 (0.240) 0.8050 (0.259) 0.7980 (0.432) þ

Accuracy 0.9328 (0.073) 0.9327 (0.162) 0.9308 (0.236) 0.8837 (0.260) 0.8558 (0.280) 0.8433 (0.292) 0.8349 (0.299) −

F1 score

0.9446 (0.003) 0.9072 (0.033) 0.8960 (0.169) 0.8462 (0.198) 0.8129 (0.276) 0.7756 (0.302) 0.7596 (0.343) þ

MCC

Table 4. Comparison of individual objectives and their combination for proposed PPI prediction algorithm for 25 runs for DIP database (standard deviation in parenthesis below mean metric value).

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

Fig. 5. ROC plot for di®erent PPI prediction algorithms for BioGrid database.

Table 5. AUC obtained from Fig. 5 (standard deviation in parenthesis below Mean AUC metric value). FANS

MOPSO

CLSBA

REP

RSS

DCC

RDF

0.942 (0.09)

0.910 (0.17)

0.898 (0.19)

0.873 (0.19)

0.854 (0.27)

0.784 (0.39)

0.602 (0.92)

optimization methods have attained higher values than other competitor classi¯cation method. A plot of Precision versus Recall (PROC curve) is given in Figs. 6(a) and 6(b). Figure 6(a) plots the values of Precision and Recall for various PPI prediction algorithm for interactions obtained from BioGrid database whereas Fig. 6(b) plots the same for DIP database. In the task of predicting PPIs, we put more emphasis on Precision since low reliability is one of the main weaknesses of the experimental methods. At the same time, we aim to obtain a reasonable value of Sensitivity (Recall). Our experience of working with FANS substantiated by the plot in Figs. 6(a) and 6(b) indicates that the proposed PPI prediction algorithm in general o®ers good level of Precision and Recall. To assess the relative merits of the algorithms, a straight line is drawn making an angle of 45  with the Recall axis such that it passes through all the curves corresponding to all contender algorithms. In our analysis, we have taken the distance from the origin to the intersecting point between the straight line and each PROC curve as a measure of the performance of respective algorithm. The higher the measure, the better is the performance. Symbol \" is used to represent the relative performance of any two competitive algorithms. Using this convention, the ranking of the algorithms can be depicted as FANS  MOPSO  CLSBA  REP  RSS  DCC  RDF. Tables 6 and 7 are used to report the mean and standard deviation of best-of-run values of the performance metrics for 25 independent runs of each PPI prediction 1650008-21

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

(a)

(b) Fig. 6. (a) PROC plot for di®erent PPI prediction algorithms for BioGrid database. (b) PROC plot for di®erent PPI prediction algorithms for DIP database.

algorithm (considered in the comparative framework) for BioGrid and DIP databases, respectively. The last rows of Tables 6 and 7 denote the statistical signi¯cance (realized by pair two-tailed t-test) of di®erences between the metric values achieved by the best and the second best algorithms. The symbols carry the same meaning as in Tables 3 and 4. From Table 6, it is interesting to observe that out of nine performance metrics, in seven cases, FANS outperforms its nearest neighbor competitor in a statistically signi¯cant manner. Here MOPSO-based PPI prediction method, outperforming FANS in case of NPV, attains the second best rank. In case of Accuracy, FANS has obtained statistically equivalent result to that of MOPSO. However, here also FANS provides the least standard deviation demonstrating its consistency in performance. The similar observation is applicable for Table 7 also. It is evident that FANS outperforms other contenders in a statistically signi¯cant fashion, except for Speci¯city. Though FANS achieves maximum value of Speci¯city, the di®erence between 1650008-22

1650008-23

Stat. Sig.

RDF

DCC

RSS

REP

CLSBA

MOPSO

FANS

Algorithms

0.9411 (0.082) 0.9065 (0.141) 0.8934 (0.164) 0.8612 (0.201) 0.8414 (0.275) 0.7637 (0.466) 0.6410 (0.629) þ

Sensitivity

0.9556 (0.132) 0.9499 (0.166) 0.9462 (0.194) 0.8982 (0.285) 0.7851 (0.349) 0.6822 (0.599) 0.5815 (0.617) þ

Speci¯city 19.945 (0.141) 18.976 (0.173) 8.3173 (0.183) 7.1431 (0.196) 3.3521 (0.437) 1.9834 (0.512) 1.5317 (0.651) þ

PLR 0.0943 (0.106) 0.1143 (0.163) 0.14313 (0.172) 0.1941 (0.198) 0.2835 (0.396) 0.4432 (0.676) 0.6174 (0.792) þ

NLR 0.9723 (0.006) 0.9087 (0.009) 0.8748 (0.011) 0.8672 (0.042) 0.7312 (0.502) 0.6679 (0.588) 0.5804 (0.734) þ

Precision 0.8898 (0.109) 0.9012 (0.093) 0.8534 (0.142) 0.7987 (0.192) 0.7425 (0.286) 0.6322 (0.324) 0.5160 (0.509) 

NPV 0.9034 (0.060) 0.9034 (0.143) 0.8654 (0.163) 0.8734 (0.132) 0.7828 (0.521) 0.7854 (0.564) 0.6313 (0.694) NA

Accuracy 0.9432 (0.062) 0.8953 (0.087) 0.8593 (0.093) 0.8813 (0.097) 0.7635 (0.347) 0.7671 (0.537) 0.6092 (0.896) þ

F1 score

0.9487 (0.005) 0.8999 (0.003) 0.8986 (0.003) 0.8921 (0.007) 0.7854 (0.457) 0.6832 (0.547) 0.6151 (0.561) þ

MCC

Table 6. Comparison of di®erent PPI prediction algorithms for 25 runs for BioGrid database (standard deviation in parenthesis below mean metric value).

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

1650008-24

Stat. Sig.

RDF

DCC

RSS

REP

CLSBA

MOPSO

FANS

Algorithms

0.9213 (0.110) 0.8949 (0.154) 0.8684 (0.178) 0.8529 (0.219) 0.7862 (0.392) 0.6735 (0.600) 0.5336 (0.621) þ

Sensitivity

0.9555 (0.156) 0.9474 (0.182) 0.9033 (0.276) 0.8845 (0.337) 0.7090 (0.497) 0.5837 (0.606) 0.4746 (0.612) 

Speci¯city 20.704 (0.146) 17.038 (0.177) 8.9879 (0.190) 7.3904 (0.363) 2.7022 (0.484) 1.6182 (0.521) 1.0159 (1.265) þ

PLR 0.0823 (0.136) 0.1108 (0.166) 0.1455 (0.190) 0.1662 (0.303) 0.3014 (0.425) 0.5592 (0.766) 0.9823 (0.538) þ

NLR 0.9356 (0.006) 0.8838 (0.009) 0.8693 (0.024) 0.8028 (0.252) 0.7233 (0.546) 0.6629 (0.681) 0.5142 (0.650) þ

Precision 0.9007 (0.096) 0.8857 (0.107) 0.8354 (0.176) 0.7463 (0.215) 0.6569 (0.311) 0.6141 (0.387) 0.4398 (0.406) þ

NPV 0.9034 (0.116) 0.8656 (0.155) 0.8684 (0.160) 0.7829 (0.311) 0.7839 (0.540) 0.7499 (0.605) 0.6029 (0.607) þ

Accuracy 0.9328 (0.073) 0.8605 (0.088) 0.8751 (0.094) 0.7814 (0.182) 0.7656 (0.383) 0.7257 (0.624) 0.5645 (0.704) þ

F1 score 0.9446 (0.003) 0.8995 (0.003) 0.8933 (0.003) 0.8463 (0.314) 0.7390 (0.495) 0.6589 (0.556) 0.5877 (0.551) þ

MCC

Table 7. Comparison of di®erent PPI prediction algorithms for 25 runs for DIP database (standard deviation in parenthesis below mean metric value).

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

Fig. 7. Original subnetwork of PPI network in yeast.

the values obtained by FANS and MOPSO for the same metric is not statistically signi¯cant with a con¯dence level of  ¼ 0:05. In order to justify the philosophy of maximizing individual objectives for predicting PPI, we have considered a sub-network of the PPI dataset of SC (Fig. 7) comprising 11 proteins, namely CTR9, LEO1, RTF1, PAF1, POB3, RPO21, CKB1, CKB2, HTZ1, SET2, and CTK1. The values of three objectives are tabulated in Table 8 for one interacting protein pair, POB3–SET2 and also for one noninteracting protein pair, POB3–CLN1 selected from the network in Fig. 7. For the interaction formed by proteins i and j, the objectives are de¯ned as follows. 1. sf ðpi ; pj Þ measures the functional similarity between proteins i and j, calculated using (4). 2. sd ðpi ; pj Þ measures the similarity in domain interaction pro¯les of proteins i and j, computed using (5). 3. jni;j j=N measures the common neighborhood ratio of proteins i and j, where their common neighborhood ni;j is identi¯ed by following (16.2). In Table 8A, the functions of proteins POB3, SET2, and CLN1 are shown to be annotated by 20, 24, and 6 GO terms, respectively. The interacting protein pair POB3–SET3 has four common GO terms between them whereas the noninteracting protein pair POB3–CLN1 have only one common GO term between them. Following the principle of Sec. 2.2 and using expression (4), we see that value of sf ðpi , pj Þ is higher for the interacting protein pair than noninteracting proteins indicating that functional similarity is a good indicator of interaction. In Table 8B, protein POB3 has two unique domains (PF03531, PF08512) which interact with 10 domains. Protein SET2 has a unique domain (PF00856) which interacts with 30 domains and protein CLN1 has a unique domain (PF00134) which interacts with 49 domains. It is evident that the interacting protein pair (POB3–SET2) have four common domains in their interacting domain lists whereas 1650008-25

A. Chowdhury, P. Rakshit & A. Konar Table 8. Case study on ¯tness function values for interacting and noninteracting protein pairs. A. Observation on functional similarity. GO annotation of biological processes of interacting proteins Proteins POB3

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

SET2

GO terms GO:0006261, GO:0034401, GO:0006974, GO:0034724, GO:0045899, GO:0006281, GO:0006351, GO:0006260, GO:0006355, GO:0006974, GO:0005634, GO:0005694, GO:0000790, GO:0005658, GO:0035101, GO:0031298, GO:0031491, GO:0003682, GO:0042393, GO:0003677 GO:0030437, GO:0016571, GO:0006355, GO:0016575, GO:0006354, GO:2000616, GO:0071441, GO:0010452, GO:0045128, GO:0030174, GO:0035066, GO:0060195, GO:0006351, GO:0032259, GO:0034968, GO:0005634, GO:0005694, GO:0005829, GO:0042054, GO:0018024, GO:0008168, GO:0046975, GO:0016740, GO:0042054 sf ðPOB3; SET2Þ ¼ 0:769 GO annotation of biological processes of noninteracting proteins

Proteins POB3

CLN1

GO terms GO:0006261, GO:0034401, GO:0006974, GO:0034724, GO:0045899, GO:0006281, GO:0006351, GO:0006260, GO:0006355, GO:0006974, GO:0005634, GO:0005694, GO:0000790, GO:0005658, GO:0035101, GO:0031298, GO:0031491, GO:0003682, GO:0042393, GO:0003677 GO:0000079, GO:0051301, GO:0007049, GO:0005634, GO:0005737, GO:0016538 sf ðPOB3; CLN1Þ ¼ 0:224 B. Observation on domain interaction Domain–domain interaction of interacting proteins

Proteins

Unique domains

POB3

PF03531, PF08512

SET2

PF00856

Interacting domains PF00069, PF00125, PF00505, PF03513, PF00125, PF00467, PF00349, PF01992, PF03727, PF04438, PF00651 PF00009, PF00023, PF00036, PF00076, PF00096, PF00096, PF00104, PF00105, PF00249, PF00385, PF00400, PF00439, PF00467, PF00505, PF00557, PF00628, PF00651, PF00855, PF00856, PF01426, PF01429, PF01753, PF01992, PF02008, PF02146, PF02178, PF02493, PF05033, PF07645, PF09273 sd ðPOB3; SET2Þ ¼ 0:5368

Domain–domain interaction of noninteracting proteins Proteins

Unique domains

POB3

PF03531, PF08512

CLN1

PF00134

Interacting domains PF00069, PF00125, PF00505, PF03513, PF00125, PF00467, PF00349, PF01992, PF03727, PF04438, PF00651 PF00009, PF00011, PF00012, PF00023, PF00046, PF00069, PF00076, PF00104, PF00134, PF00172, PF00231, PF00240, PF00254, PF00270, PF00324, ………., PF00400, PF00433, PF00435, PF02319, PF02630, PF02984, PF03143, PF03144, PF03234, PF03775, PF04153, PF06391, PF07392, PF08934, PF09080, PF09241 sd ðPOB3; CLN1Þ ¼ 0:0012

1650008-26

Prediction of PPI network using a MOO approach Table 8. (Continued ) C. Similarity in neighboring proteins Set of proteins directly interacting POB3 SET2

CTR9, LEO1, RTF1, CKB1, CKB2, SET2 CTR9, LEO1, RTF1, PAF1, POB3, RPO21, CKB1, HTZ1, CTK1 jnPOB3;SET2 j=N ¼ 4=11 ¼ 0:36364 Set of proteins directly interacting

POB3 CLN1

CTR9, LEO1, RTF1, CKB1, CKB2, SET2 SWI4, CKS1, MBP1, CLB1, FUS3, CDC6

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

jnPOB3;CLN1 j=N ¼ 0=11 ¼ 0

there exists only one common domain in the interacting domain lists of noninteracting protein pair (POB3–CLN1). Then sd ðpi ; pj Þ is evaluated for both interacting and noninteracting protein pairs using (5). It can be observed from Table 8B the value of sd ðpi ; pj Þ for the interacting pair is higher than noninteracting protein pair, revealing a higher degree of similarity between the domain interaction pro¯les of interacting protein pair. In Table 8C, we present the set of proteins directly interacting with proteins POB3, SET2, and CLN1 in subnetwork (of total N ¼ 11 proteins). The interacting protein POB3–SET2 has four common neighbors whereas the noninteracting protein pair does not have any common neighboring protein. Hence the value of jni;j j=N in Table 8C is higher for interacting protein pair whereas it is zero for noninteracting protein pair. The values of jni;j j=N indicate that neighborhood topology information can be used for predicting the PPIs.

(a)

(b)

(c)

Fig. 8. Subnetwork obtained by PPI prediction algorithms: (a) FANS, (b) MOPSO, (c) CLSBA, (d) REP Tree, (e) RSS, (f) DCC, and (g) RDF. 1650008-27

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

(d)

(e)

(f)

(g) Fig. 8. (Continued )

Table 8 thus signi¯es the selection of the three characteristic features for predicting the protein interactions. The predicted PPIs for the same subnetwork obtained using seven competitor algorithms is pictorially represented in Fig. 8. Comparing Fig. 7 with Fig. 8, it is apparent that FANS-based method outperforms other competitors in predicting correct PPIs.

5. Conclusion In this paper, the PPI problem is formulated as an MOO problem. The proposed FANS algorithm is used to predict PPI network. Here we have analyzed the e®ect of three essential characteristic features of proteins for competently predicting PPI network, including (i) functional similarity, (ii) domain interaction pro¯le, and (iii) 1650008-28

Prediction of PPI network using a MOO approach

commonality in neighboring proteins of interacting protein pairs. The proposed method performs well on unbalanced data. Experiments undertaken reveal the superiority of the proposed method over its state-of-the-art contenders in predicting PPIs in a statistically signi¯cant manner.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Appendix A. The interaction weights between protein pairs in a PPI network with N proteins can be represented by a two-dimensional matrix X of dimension N  N as follows. Here wi;j signi¯es the weight of interaction between proteins pi and pj for i, j ¼ ½1; 2; . . . ; N but i 6¼ j. Self-interaction is neglected by setting wi;i ¼ 0 for i ¼ ½1; 2; . . . ; N. p1 p2 p3 ··· pN   p1 0 w1,2 w1,3 . . . w1,N   p2  0 w2,3 . . . w2,N    w2,1   ðA:1Þ X = p3  w3,1 w3,2 . 0 . . . w w 3,N     :  : : : : :  0 pN wN,1 wN,2 wN,3 . . . For symmetric undirected network wi;j ¼ wj;i for i, j ¼ ½1; 2; . . . ; N. Hence, the same information about the connectivity between protein pairs in the PPI network can be revealed by transforming X into an upper triangular matrix, given as follows. p 1 p2 p3 . . . pN   p1 0 w1,2 w1,3 . . . w1,N   p2  0 w2,3 . . . w2,N   0   ðA:2Þ X = p3  0 . 0 0 . . . w w 3,N     :  : : : : :  0 0 ... 0 pN 0 It is evident from (A.2) that the number of nonzero elements in X is N  ðN  1Þ=2. To reduce the storage space, the nonzero elements of X now can be represented as elements of a 1  ðN  ðN  1Þ=2Þ vector Z as given below. 1 w1;2

2

3

w1;3

w1;4

…………………………………………………………..

NðN1Þ 2

………..

wN1;N

w1;N

w2;3

w2;4

1650008-29

……..

w2;N

……..

A. Chowdhury, P. Rakshit & A. Konar

The index of the weight wi;j in Z ¼ k ðsayÞ ¼ Number of elements up to wi;j in matrix X ¼ Total number of elements in first ði  1Þ rows þ Number of elements up to the jth column in the ith row ¼ ðN  1Þ þ ðN  2Þ þ    þ ðN  ði  1ÞÞ þ ðj  iÞ ¼ N  ði  1Þ  ð1 þ 2 þ    þ ði  1Þ þ iÞ þ j

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

¼ N  ði  1Þ 

i  ði þ 1Þ þ j: 2

ðA:3Þ

Acknowledgment Funding by Council of Scienti¯c and Industrial Research (CSIR) (for awarding Senior Research Fellowship to the second author) and UGC (for UPE-II program) are gratefully acknowledged for the present work. References 1. Eisenberg D, Marcotte EM, Xenarios I, Yeates TO, Protein function in the post-genomic era, Nature 405(6788):823–826, 2000. 2. Qi Y, Bar-Joseph Z, Klein-Seetharaman J, Evaluation of di®erent biological data and computational classi¯cation methods for use in protein interaction prediction, Proteins 63(3):490–500, 2006. 3. Gavin AC, B€osche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, H€ofert C, Schelder M, Brajenovic M, Ru®ner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G, Functional organization of the yeast proteome by systematic analysis of protein complexes, Nature 415(6868):141–147, 2002. 4. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandor® S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M, Systematic identi¯cation of protein complexes in Saccharomyces cerevisiae by mass spectrometry, Nature 415(6868):180–183, 2002. 5. Fields S, Song O, A novel genetic system to detect protein–protein interactions, Nature 340(6230):245–246, 1989. 6. MacBeath G, Schreiber SL, Printing proteins as microarrays for high throughput function determination, Science 289(5485):1760–1763, 2000. 7. Kini RM, Evans HJ, Prediction of potential protein–protein interaction sites from amino acid sequence: Identi¯cation of a ¯brin polymerization site, FEBS Lett 385(1):81–86, 1996.

1650008-30

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

8. Jones S, Thornton JM, Prediction of protein–protein interaction sites using patch analysis, J Mol Biol 272(1):133–143, 1997. 9. Pazos F, Helmer-Citterich M, Ausiello G, Valencia A, Correlated mutations contain information about protein–protein interaction, J Mol Biol 271(4):511–523, 1997. 10. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA, Protein interaction maps for complete genomes based on gene fusion events, Nature 402(6757):86–90, 1999. 11. Goh CS, Bogan AA, Joachimiak M, Walther D, Cohen FE, Co-evolution of proteins with their interaction partners, J Mol Biol 299(2):283–293, 2000. 12. Yang XS, Fire°y algorithms for multimodal optimization, in Stochastic Algorithms: Foundations and Applications, Springer Berlin Heidelberg, pp. 169–178, 2009. 13. Miller JP, Lo RS, Ben-Hur A, Desmarais C, Stagljar I, Noble WS, Fields S, Large-scale identi¯cation of yeast integral membrane protein interactions, Proc Natl Acad Sci USA 102(34):12123–12128, 2005. 14. Patil A, Nakamura H, Filtering high-throughput protein–protein interaction data using a combination of genomic features, BMC Bioinformatics 6(1):100, 2005. 15. Shin CJ, Wong S, Davis MJ, Ragan MA, Protein–protein interaction as a predictor of subcellular location, BMC Syst Biol 3(1):28, 2009. 16. Sprinzak E, Margalit H, Correlated sequence-signatures as markers of protein–protein interactions, J Mol Biol 311(4):681–692, 2001. 17. Wan KK, Park J, Suh JK, Large scale statistical prediction of protein–protein interaction by potentially interacting domain (PID) pair, Genome Informatics 13:42–50, 2002. 18. Ng SK, Zhang Z, Tan SH, Integrative approach for computationally inferring protein domain interactions, Proc 2003 ACM Symp Applied Computing, ACM, pp. 115–121, 2003. 19. Han D, Kim HS, Seo J, Jang W, A domain combination based probabilistic framework for protein–protein interaction prediction, Genome Informatics 14:250–259, 2003. 20. Friedel CC, Zimmer R, Inferring topology from clustering coe±cients in protein–protein interaction networks, BMC Bioinformatics 7(1):519, 2006. 21. Kuchaiev O, Rasajski M, Higham DJ, Przulj N, Geometric de-noising of protein–protein interaction networks, PLoS Comput Biol 5(8):e1000454, 2009. 22. Schlicker A, Domingues FS, Rahnenfuhrer J, Lengaurer T, A new measure for functional similarity of gene products based on gene ontology, BMC Bioinformatics 7(1):302, 2006. 23. Chowdhury A, Rakshit P, Konar A, Nagar AK, A modi¯ed bat algorithm to predict protein–protein interaction network, Proc IEEE Congress on Evolutionary Computation (CEC), pp. 1046–1053, 2014. 24. Hou J, Chi X, Predicting protein functions from PPI networks using functional aggregation, Math Biosci 240(1):63–69, 2012. 25. Reimand J, Hui, S, Jain S, Law B, Bader GD, Domain-mediated protein interaction prediction: From genome to network, FEBS Lett 586(17):2751–2763, 2012. 26. Paradesi MSR, Caragea D, Hsu WH, Structural prediction of protein–protein interactions in Saccharomyces cerevisiae, Proc IEEE 7th Int Symp BioInformatics and BioEngineering, Vol. 2, pp. 1270–1274, 2007. 27. Srinivas N, Deb K, Muiltiobjective optimization using nondominated sorting in genetic algorithms, Evol Comput 2(3):221–248, 1994. 28. Bhattacharyya S, Rakshit P, Konar A, Tibarewala DN, Janarthanan R, Feature selection of motor imagery EEG signals using ¯re°y temporal di®erence Q-Learning and support vector machine, Swarm, Evolutionary, and Memetic Computing, Springer International Publishing, pp. 534–545, 2013. 29. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M, BioGRID: A general repository for interaction datasets, Nucl Acids Res 34(Suppl 1):D535–D539, 2006.

1650008-31

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

30. Bernstein FC, Koetzle TF, Williams GJ, Meyer Jr EF, Brice MD, Rodgers JR, Tasumi M, The Protein Data Bank: A computer-based archival ¯le for macromolecular structures, Arch Biochem Biophys 185(2):584–591, 1978. 31. Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, Simison M, Weng S, Wong ED, Saccharomyces Genome Database: The genomics resource of budding yeast, Nucl Acids Res gkr1029, 2011. 32. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Gri±ths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR, The Pfam protein families database, Nucl Acids Res 32(Suppl 1):D138–D141, 2004. 33. Li A, Horvath S, Network neighborhood analysis with the multi-node topological overlap measure, Bioinformatics 23(2):222–231, 2007. 34. Wu X, Zhu L, Guo J, Zhang DY, Lin K, Prediction of yeast protein–protein interaction network: Insights from the Gene Ontology and annotations, Nucl Acids Res 34(7):2137– 2150, 2006. 35. Chen XW, Liu M, Prediction of protein–protein interactions using random decision forest framework, Bioinformatics 21(24):4394–4400, 2005. 36. Jang WH, Jung SH, Han DS, A computational model for predicting protein interactions based on multidomain collaboration, IEEE/ACM Trans Comput Biol Bioinformatics 9(4):1081–1090, 2012. 37. Licamele K, Getoor L, Predicting protein–protein interactions using relational features, Proc ICML Workshop on Statistical Network Analysis, Technical Report, University of Maryland (2006), pp. 1–7. 38. Coello CAC, Lechuga MS, MOPSO: A proposal for multiple objective particle swarm optimization, Proc IEEE Congress Evolutionary Computation, Vol. 2, pp. 1051–1056, 2002. 39. Yellaboina S, Tasneem A, Zaykin DV, Raghavachari B, Jothi R, DOMINE: A comprehensive collection of known and predicted domain–domain interactions, Nucl Acids Res 39(Suppl 1):D730–D735, 2011. 40. Roy AG, Rakshit P, Konar A, Bhattacharya S, Kim E, Nagar AK, Adaptive ¯re°y algorithm for nonholonomic motion planning of car-like system, Proc IEEE Congress on Evolutionary Computation, pp. 2162–2169, 2013. 41. Rakshit P, Konar A, Das S, Jain LC, Nagar AK, Uncertainty management in di®erential evolution induced multiobjective optimization in presence of measurement noise, IEEE Trans Syst Man Cybern Syst 44(7):922–937, 2014. 42. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D, DIP, the database of interacting proteins: A research tool for studying cellular networks of protein interactions, Nucl Acids Res 30(1):303–305, 2002. 43. Ben-Hur A, Noble WS, Choosing negative examples for the prediction of protein–protein interactions, BMC Bioinformatics 7(Suppl 1):S2, 2006.

1650008-32

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

Prediction of PPI network using a MOO approach

Archana Chowdhury received the degree of B.E. in Computer Science and Engineering from Maharana Pratap College of Technology, Gwalior, India and M.E. in Computer Engineering from Jadavpur University, Kolkata in 2008. She is currently working towards her Ph.D. degree in Handling Protein–Protein Interaction and Protein Ligand Docking Problem by a Machine Intelligence Approach, under the guidance of Prof. Amit Konar from Jadavpur University, Kolkata. Her research interests include arti¯cial intelligence, fuzzy logic, evolutionary computation, pattern recognition, bioinformatics, and data mining. She has published over 12 papers in various conference proceedings and journals. She has served as a reviewer for many conference and journal papers.

Pratyusha Rakshit received the B.Tech. degree in Electronics and Communication Engineering (ECE) from Institute of Engineering and Management, India, and M.E. degree in Control Engineering from Electronics and Telecommunication Engineering (ETCE) Department, Jadavpur University, India in 2010 and 2012, respectively. She has submitted her Ph.D. (Engineering) thesis on co-ordination in multi-agent robotics in the presence of measurement noise under the guidance of Prof. Amit Konar at Jadavpur University, India. From August 2015 to November 2015, she was an Assistant Professor in ETCE Department, Indian Institute of Engineering Science and Technology, India. She is currently an Assistant Professor in ETCE Department, Jadavpur University. Her principal research interests include arti¯cial and computational intelligence, evolutionary computation, robotics, bioinformatics, pattern recognition, fuzzy logic, cognitive science, and human–computer interaction. She is an author of over 47 papers published in top international journals and conference proceedings. She serves as a reviewer in IEEE-ReTIS, IEEE-TFS, IEEE-SMC: Systems, Neurocomputing, Applied Soft Computing, IJAISC, IJSI, and FUZZ-IEEE.

1650008-33

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY on 02/05/16. For personal use only.

A. Chowdhury, P. Rakshit & A. Konar

Amit Konar received the B.E. degree from Bengal Engineering and Science University (B.E. College) Howrah, India, in 1983 and the M.E. Tel E, M.Phil., and Ph.D. (Engineering) degrees from Jadavpur University, Calcutta-700032, India, in 1985, 1988, and 1994, respectively. In 2006, he was a Visiting Professor with the University of Missouri, St. Louis. He is currently a Professor with the Department of Electronics and Tele-Communication Engineering (ETCE), Jadavpur University, where he is the Founding Coordinator of the M.Tech. program on intelligent automation and robotics. He has supervised 15 Ph.D. theses. He has over 200 publications in international journal and conference proceedings. He is the author of eight books, including two popular texts Arti¯cial Intelligence and Soft Computing (CRC Press, 2000) and Computational Intelligence: Principles, Techniques and Applications (Springer, 2005). He serves as the Associate Editor of IEEE Transactions on Systems, Man and Cybernetics, Part A and IEEE Transactions on Fuzzy Systems. His research areas include the study of computational intelligence algorithms and their applications to the various domains of electrical engineering and computer science. Speci¯cally, he worked on fuzzy sets and logic, neurocomputing, evolutionary algorithms, Dempster–Shafer theory, and Kalman ¯ltering, and applied the principles of computational intelligence in image understanding, VLSI design, mobile robotics, pattern recognition, brain–computer interfacing, and computational biology. He was the recipient of All India Council for Technical Education (AICTE)accredited 1997–2000 Career Award for Young Teachers and Fellow of National Academy of Engineers 2015 for his signi¯cant contribution in teaching and research.

1650008-34

Prediction of protein-protein interaction network using a multi-objective optimization approach.

Protein-Protein Interactions (PPIs) are very important as they coordinate almost all cellular processes. This paper attempts to formulate PPI predicti...
1MB Sizes 1 Downloads 8 Views