Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: Evolutionary enhanced Markov clustering.

Accepted Manuscript Title: Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: evolutionary enhanced Markov clustering Author: Konstantinos Theofilatos Niki Pavlopoulou Christoforos Papasavvas Spiros Likothanassis Christos Dimitrakopoulos Efstratios Georgopoulos Charalampos Moschopoulos Seferina Mavroudi PII: DOI: Reference:

S0933-3657(14)00150-X http://dx.doi.org/doi:10.1016/j.artmed.2014.12.012 ARTMED 1385

To appear in:

ARTMED

Received date: Revised date: Accepted date:

15-2-2013 23-12-2014 26-12-2014

Please cite this article as: Theofilatos K, Pavlopoulou N, Papasavvas C, Likothanassis S, Dimitrakopoulos C, Georgopoulos E, Moschopoulos C, Mavroudi S, Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: evolutionary enhanced Markov clustering, Artiﬁcial Intelligence in Medicine (2015), http://dx.doi.org/10.1016/j.artmed.2014.12.012 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: evolutionary enhanced Μarkov clustering

ip t

Konstantinos Theofilatosa,*, Niki Pavlopouloua, Christoforos Papasavvasa, Spiros Likothanassisa, Christos Dimitrakopoulosa, Efstratios Georgopoulosb, Charalampos Moschopoulosc,d, Seferina Mavroudia,e,* a

cr

Department of Computer Engineering and Informatics, University of Patras, Building B University Campus Rio, Zip Code: 26500, Patras, Greece, telephone number: +302610996985. b

us

Department of Agricultural Technology, Technological Educational Institute of Kalamata, Antikalamos, Zip Code: 24100, Kalamata, Greece, telephone number: + 3027210 45100. c Department of Electrical Engineering, Katholieke Universiteit, Kasteelpark Arenberg 10 box 2440, , Zip Code: 3001, Leuven Belgium, telephone number: +32 16 3 21130.

d

an

iMinds Future Health Department, Katholieke Universiteit, Oude Markt 13 - bus 5005, Zip Code: 3000, Leuven, Belgium, telephone number: +3216324010. e

M

Department of Social Work, School of Sciences of Health and Care, Technological Educational Institute of Patras, M. Alexandrou str. 1, Zip Code: 263 34, Patras, Greece, telephone number: +302610369349. *Corresponding Authors:

d

Theofilatos Konstantinos, email: [email protected], telephone number: +306973641685, address: Building B, University Campus Rio, Patras, Greece

Ac ce p

te

Mavroudi Seferina, email: [email protected], telephone number: +302610997520, address: Building B, University Campus Rio, Patras, Greece

•

EE-MC is a new unsupervised methodology for predicting protein complexes from weighted PPI graphs

•

It is by design able to overcome intrinsic limitations of existing methodologies

•

It outperformed existing methodologies increasing the separation metric by 10-20%

•

72.58% of the predicted protein complexes in human are enriched for at least one GO function term

Abstract.

Page 1 of 22

ip t

Objective. Proteins are considered to be the most important individual components of biological systems andthey combine to form physical protein complexes which are responsible for certain molecular functions. Despite the large availability of protein-protein interaction (PPI) information, not much information is available about protein complexes. Experimental methods are limited in terms of time, efficiency, cost and performance constraints. Existing computational methods have provided encouraging preliminary results, but they phase certain disadvantages as theyrequire parameter tuning, some of them cannot handle weighted PPI data and others do not allow a protein to participate in more than one protein complex. In the present paper, we propose a new fully unsupervised methodology for predicting protein complexes from weighted PPI graphs.

us

cr

Methods and materials. The proposed methodology is called evolutionary enhanced Μarkov clustering (EE-MC) and it is a hybrid combination of an adaptive evolutionary algorithm and a state-ofthe-art clustering algorithm named enhanced Markov clustering. EE-MC was compared with state-ofthe-art methodologies when applied to datasets from the human and the yeast saccharomyces cerevisiae organisms

an

Results. Using public available datasets, EE-MC outperformed existing methodologies (in some datasets the separation metric was increased by 10-20%). Moreover, when applied to new human datasets its performance was encouraging in the prediction of protein complexeswhich consist of proteins with high functional similarity. In specific, 5737 protein complexes were predictedand72.58% of them are enriched for at least one gene ontology (GO) function term.

d

M

Conclusions. EE-MC is by design able to overcome intrinsic limitations of existing methodologies such as their inability to handle weighted PPI networks, their constraint to assign every protein in exactly one cluster and the difficulties they face concerning the parameter tuning. This fact was experimentally validated and moreover, new potentially true human protein complexes were suggestedas candidates for further validation using experimental techniques.

Ac ce p

1 Introduction

te

Keywords: Evolutionary algorithms, evolutionary enhanced Markov clustering, genetic algorithms, large scale biological networks analysis, weighted protein-protein interaction networks, protein complex prediction, functional characterization of proteins and protein complexes

Most functions within a living cell contribute to a coordinated sequence of a large number of molecular interaction events. Proteins are considered to be the most important molecules from the ones participating in such interaction events. They transmit regulatory signals throughout the cell, catalyze a huge number of chemical reactions, and are critical for the stability of numerous cellular structures. Proteins do not function only as single units, but they interact at small and large scales of the cellular interconnecting components. At smaller scales, proteins bind to form Proteinprotein interactions (PPIs), whereas at larger scales they bind in groups to form functional protein complexes. The prediction of these complexes is crucial for understanding the cellular mechanisms and for predicting the function of uncharacterized proteins. However, their experimental prediction is limited to a few methods, such as tandem affinity purification (TAP) [1], which provide partially erroneous data and demand high cost without being time-efficient. During the last decade, the research community has acquired an enormous amount of small scale PPI data. This fact is contributed to the development of high throughput experimental PPI prediction methods such as yeast two-hybrid (Y2H) system [2],

Page 2 of 22

cr

ip t

mass spectrometry (MS) [3] and protein microarrays [4]. Moreover, computational methods have been developed to filter experimentally verified PPIs, to compute a confidence score for each PPI and to predict new ones [5]. Taking advantage of this emerging amount of PPI data, scientists have deployed in silico analysis of PPI graphs [6]. Specifically, the known or computationally predicted PPIs of an organism are represented as an undirected graph G= (V, E), where nodes V represent the proteins and edges E the PPIs. These graphs are then analyzed with computational clustering methods in order to predict the large scale protein complexes. When for each PPI a confidence score is available then this score is projected in the corresponding PPI graph as an edge weight. The use of weighted graphs has been proven to enhance the accuracy of protein complex prediction [7].

Ac ce p

te

d

M

an

us

The computational methods for predicting protein complexes are split to supervised and unsupervised ones. Supervised methods such as [8, 9] use previously known protein complexes to guide the procedure of extracting protein complexes from the PPI graphs. The limited information which is nowadays available about protein complexes of high confidence is a very strong drawback of supervised techniques in predicting new protein complexes. Thus, the protein complex prediction problem is better formulated as uncovering the protein complexes within a PPI graph and not as classifying proteins to known complexes.The most established unsupervised methods for the prediction of protein complexes are markov clustering (MCL) [10], RNSC [11], Mcode [12] and local clique ,erging algorithm (LCMA) [13]. Although they have been used extensively with encouraging initial results, they exhibit some very important drawbacks. In particular, they require a time-consuming and tricky parameter tuning procedure, some of them cannot handle weighted PPI data and others do not allow a protein to participate in more than one complex.The ability to predict overlapping protein complexes is one of the most important properties of a protein complex prediction method. In particular, proteins participate in more than one complex when they perform different cellular functions and consequently contribute to pleiotropic phenotypes when mutated [14]. Moreover, there exist many complexes whose composition and function may vary according to the context and conditions [15]. Several algorithms [16-20] have been proposed lately to overcome these drawbacks but their proposed solutions are partial and cannot handle all the existing limitations. The MCL algorithm [10] is among the prevailing methods for clustering PPI graphs in order to predict protein complexes. Using stochastic Markov matrices, this algorithm performs random walks through the graph and computes their probabilities to find the optimal paths in a deterministic manner. The transition probabilities are controlled by a parameter which is called inflation rate. Using the inflation operator, the MCL algorithm performs stochastic matrix transformations to create non connected subgraphs iteratively. As the inflation parameter gets higher the MCL algorithm predicts more clusters which are smaller in size. Despite its widespread usage, it has some strong limitations with the most important one being its restriction to assign each protein to only one protein complex. The second limitation

Page 3 of 22

of the MCL algorithm is that its performance is highly related to the manual optimal selection of the inflation parameter.

us

cr

ip t

To overcome the above MCL’s problem, Moschopoulos et al. (2008) [21], proposed the enhanced Markov clustering (EMC) method which is an improvement of the MCL algorithm. In specific, it deploys the MCL algorithm to make an initial clustering and then it improves it by applying 4 different filtering methods: density filter, haircut, best neighbor and cutting edge operators. However, the application of these methods requires tuning of their parameters. Moreover, they have not been designed to function on weighted graphs and therefore the transformation of weighted graphs to binary ones is a prerequisite. In [22] the authors proposed GAPPI method which is an application of a genetic algorithm (GA) to optimize the parameters of the MCL filtering method. However, the specific algorithm uses a supervised procedure, GAPPI does not take advantage of weighted PPI graphs for the filters and the optimization of the filters is independent from the selection of the inflation ratio.

te

d

M

an

In the present paper, we propose a fully unsupervised method, evolutionary enhanced Markov clustering (EE-MC), which is a combination of an adaptive GA and a variation of the EMC method. The filtering methods of the EMC algorithm were adjusted to enable handling of weighted PPI graphs. Moreover, an adaptive GA was used to optimize on parallel the inflation rate and the parameters of the filtering methods of EMC. For the unsupervised guidance of the evolutionary procedure, a novel unsupervised fitness function was proposed and used. Moreover, the adaptive nature of the mutation operator which was used in the proposed evolutionary method enabled the effective solution of this demanding optimization problem.

Ac ce p

The proposed method was evaluated on three PPI graphs. Its experimental results were compared with two well established methods: MCL and RNSC. The superiority of the proposed method was proved using some benchmark protein complex datasets. To the best of our knowledge, this is the first time that a computational method for predicting PPIs is applied in large scale human PPI graphs and 5737 new protein complexes were predicted. By further analyzing the experimental results in the human interactome, we observed that 72.58% of them are enriched for at least one gene ontology (GO) [23] function term. This finding is an indicator of the quality of the prediction and suggests that the proposed method’s outcomes should further be analyzed and validated experimentally to contribute in enlarging our perspective over the human interactome. The rest of the paper is organized as following: In Section 2 the proposed methods and the examined datasets are described in detail. In Section 3 the experimental results are provided and discussed. Finally, in Section 4 useful conclusions are made and some possible future directions are proposed. 2 Material studied, methods, techniques 2.1 Datasets

Page 4 of 22

In the present paper three different PPI datasets were used to build the PPI graphs which are used as inputs for the protein complex prediction methods. The examined datasets are all weighted datasets where the value of a confidence score is assigned to each protein pair.

an

us

cr

ip t

The first dataset is a PPI dataset from the well-studied yeast organism (saccharomyces cerevisiae). It was published by Friedel et al. [24] and containsinformation about 5195 proteins and 62876 interactions. All these interactions areassigned with a confidence score whose values range from 0 to 1. These confidence scores were calculated by combining the purification experimental results and reflect the possibility of each protein pair to be a protein interaction and the frequency of each protein pair to occur as a protein interaction. The majority of yeast PPIs have been discovered with experimental techniques, and for this reason the yeast dataset is very reliable for the comparison of protein complex prediction techniques. Moreover, a plethora of high confidence protein complex datasets have been published for the yeast organism and some of them are presented in Section 2.3.

Ac ce p

te

d

M

The second dataset is a PPI dataset for the human organism. It consists of the proteininteractions included in the HPRD database [25]. These protein interactions were filtered using the method proposed in [26] and a confidence score was assigned toeach protein pair using the same methodology. This score is a non-linear combination of GO [23] annotations, gene expression profiles similarity, colocalization of proteins and presence of interacting domains. Using this scoring scheme we have achieved to incorporate functional and structural information on the extracted PPI graph. The retrieved confidence scores initially ranged from 0.075 to 2 and they were normalized to take values from 0 to 1. The extracted PPI graph consists of 7450 proteins and 21.475 interactions. The third dataset is a human PPI dataset downloaded from the HINT-KB [26]. This dataset contains 206.550 high scored computationally predicted protein interactions between 20.845 proteins. For every protein interaction, a confidence score was calculated using the same methodology as for the second dataset. 2.2 Evolutionary enhanced Μarkov clustering

EE-MC is a combination of an adaptive GA and a variation of the EMC method. The filtering methods of the EMC algorithm were adjusted to enable handling weighted PPI graphs. These filtering methods are density filter, haircut, best neighbor and cutting edge operators. In the following, these modified methodologies are presented. The weighted density of a cluster is estimated using equation (1):

densityw ( G′ ) =

2 × ∑ e∈E′ w(e) V ′ × ( V ′ − 1)

(1)

Page 5 of 22

G′ = V ′, E ′ where the predicted cluster is the subgraph with V’ and E’ being the sets of nodes and edges respectively. w(e) is the weight of each edge (e) of the set E’ .

ip t

In the density filtering method the weighted density is calculated for every predictedcluster. If its value is less than a predefined threshold then the cluster is discarded. This method aims to keep only highly interconnected clusters.

∑

w(e)

e = ( u ,v )∈E '

(2)

us

deg w (v) =

cr

The haircut method starts with calculating the weighted connectivity of each nodewhich belongs to a cluster. This is calculated using the equation (2):

where u and v are nodes connected with the edge e in the set E’.

an

Then for each cluster the mean connectivity (MC) is estimated by computing theaverage connectivity of the nodes belonging to it. If the equation (3) holds, for aspecific node of the cluster, then this node is deleted from the cluster:

M

deg w (v) < haircut _ parameter • MC

(3)

Where haircut_parameter is a user defined threshold.

∑ value ∑

u∈V ' u∈V

w(u , v ) w(u , v)

te

d

The best neighbor method is based on checking the nodes of the neighborhood of eachcluster produced by the MCL algorithm. For every one of these nodes the

Ac ce p

is calculated with the numerator being the weighted degree of G′ = V ′, E ′ and the denominator being its overall node v within the predicted cluster weighted degree. If for a node this value is greater than a predefined threshold then it is added in the cluster. This method is of crucial importance as it permits clusters to overlap. Moreover, it creates larger clusters which are closer to the real complexes as proteins in nature participate in more than one protein complexes [27]. The cutting edge filter aims at discarding clusters which are not densely connected within themselves and sparsely connected with the rest of the graph. Specifically, it discards clusters when the following inequality holds:

∑ ∑

u∈V '

w(u , v )

w(e) e∈E '

> threshold

(4)

where the numerator is the sum of the weights of the edges between nodes in the predicted clusters (u) and nodes (v) not belonging in them. The denominator is the

Page 6 of 22

sum of the weights of the edges of the cluster defined threshold.

G′ = V ′, E ′

. Threshold is a user

ip t

In our approach, these methods are applied in the following order: best neighbor, haircut, density filter, and cutting edge. This order is selected to force the first two methods to enhance the clusters predicted by the MCL before applying the two filtering methods.

cr

In the EE-MC method an adaptive GA was used to optimize in parallel the parameters of the EE-MC filtering methods and the inflation rate which is a highly influential parameter in the performance of MCL algorithm.

an

us

GAs are general optimization meta-heuristic algorithms. They are based on the initial creation of a random population of candidate solutions, called chromosomes, and their iterative variation using the deploying of evaluation, selection, crossover and mutation until some termination criteria are reached [28]. GAs have been proven to be useful and efficient in optimization problems where the search space is wide and complicated or where there is not any available mathematical analysis of the problem.

Ac ce p

te

d

M

In the present paper, we used GAs for the optimization of the parameters of EMC filters and the MCL’s inflation ratio. Binary representation was used to represent the candidate solutions as chromosomes because it allows the evolvement of small building blocks within the chromosomes and thus speeds up the convergence procedure. In specific, 10 bits are allocated in the designed chromosome for each parameter which should be optimized. In Figure 1 we provide the representation schema of the EE-MC method along with the range of values that every parametergene could take. It is easily observed that the search space of our problem is very large (250) and this fact justifies our initial selection of using an evolutionary algorithm to solve it. The size of the initial population of the EE-MC method was set to 20. To evaluate the chromosomes of the proposed evolutionary framework, a suitable unsupervised fitness value was required to produce scaled fitness and to assign high values for high performance clustering. To achieve this goal the fitness value of RNSC algorithm was modified and deployed to produce a novel unsupervised fitness function capable of evaluating weighted graphs. Initially for each node of a cluster the values γν are calculated using equation (5): (5) where wν,u is the weight of the edge between ν and u and Cν is the cluster that includes node ν.The value describes the connectivity of nodes of a predicted cluster with nodes not belonging in this cluster plus the absence of connections between nodes in the same cluster. Afterwards, the metric described in equation (6) is calculated for every node.

Page 7 of 22

(6) where Ν(v) is the number of nodes which are connected with node ν.

is used to

ip t

count the nodes which are related to a specific node ν. The related nodes are the ones belonging to the neighborhood of ν plus the ones belonging to the same predicted clusters with ν. Finally, the scaled cost function described in equation (7) is calculated using the following equation:

cr

(7)

where n is the total number of nodes in the graph and C is a clustering solution.

us

As already mentioned EE-MC may produce overlapping clusters. Thus, a proteinnode may be assigned to more than one cluster. In this case, the average values of the metrics γν and βν are estimated and used for the calculation of the cost function Cs.

(8)

te

d

M

an

The cost function described in equation (7) is a minimization function and its purpose is to assign lower values to those clustering solutions that produce clusters highly connected with themselves and sparsely connected with the rest of the network. However, for the effective functionality of an evolutionary algorithm the fitness function should be a non-negative maximization function. Thus, the final fitness function, which was utilized in EE-MC, uses the cost function described in equation(8):

Ac ce p

The selection operator which was deployed in EE-MC was the roulette wheel selection which assigns probabilities of selection to each of the chromosome proportional to their performance. When the fitness values of a GA are well scaled then this selection operator is considered to perform extremely well and this is the reason we preferred it [29]. To avoid missing good solutions, we have incorporated a simple elitism schema in the selection procedure. Specifically, in every generationiteration of the algorithm the worst solution of the population is replaced with the best solution from the previous population. Moreover, the best solution found so far by the algorithm is saved disregarding whether it is selected to participate in the next population or not.

Page 8 of 22

ip t cr us

Figure 1. Representation schema of EE-MC

Ac ce p

te

d

M

an

The variation operators used are the two-points crossover and the binary mutation.The two point crossover is used to create two offspring by combining the chromosomes of two selected parents. The parents are selected at random as well as two crossover points and two offsprings are made by exchanging the genetic material between the two crossover points of the two parents. This procedure is depicted in Figure 2. The usage of two point crossover enables the exchange of smaller parts of the solution and minimizes the probability of destroying a good solution [30]. Thus, using two point crossover operator enables us to search the space of the problem more thoroughly.

Figure 2. Two points crossover operator

The crossover probability was set equal to 0.9 to let some part of the population to survive unchanged to the next generation. Most studies on the selection of the optimal mutation rate parameter suggest that a time-variable mutation rate scheme is usually preferable than a fixed mutation rate one [31]. Accordingly, we propose the dynamic control of the mutation parameter using equation (9) to estimate the variation which is applied in the mutation probability of every iteration of the EE-MC: (9)

Page 9 of 22

Where Max_Generation is the maximum number of EE-MC iterations, Pm is the initial value of the mutation rate (it is set to a large value of 0.2) and population_size is the size of the initial population of solutions.

by

ip t

The estimated value of the equation 9 is used as follows: For every iteration, the mean similarity among the best solution of the population and the other solutions is calculated. When this value is less than 90% then the mutation probability is reduced

Pmchange , otherwise, the mutation probability is raised by the same quantity. In

an

us

cr

general, the deployed mutation operator starts with a high mutation probability for the first generations and then gradually decreases the mutation probability over the number of generations. In this manner global search characteristics are adopted in the beginning and are gradually switched to local search characteristics for the final iterations. The mutation rate is reduced with a smaller step when a small population size is used, in order to avoid stagnation effect. For larger population sizes the mutation rate is reduced with a larger step size since a quicker convergence to the global optimum is expected.

M

The termination criteria used are a combination of the maximum number of generations (this was set to 100) to be reached and a convergence criterion. The convergence criterion is satisfied when the fitness of the best solution of the population is less than 1% away from the average fitness of the population.

Ac ce p

te

d

The overall flowchart of the EE-MC methodology is presented in Figure 3.

Page 10 of 22

ip t cr us an M d te Ac ce p Figure 3. EE-MC’s flowchart 2.3 Evaluation Datasets

To evaluate the proposed methodology and compare it with other established methodswe have used a series of benchmark datasets. For all datasets, protein complexes withless than two proteins are considered as poor sets of molecular interconnecting components and are discarded. For the yeast organism, three well studied protein complex datasets were used. The first of them is the one proposed and described in [24] which is named as BT_409 and consists of 409 protein complexes. The second dataset is named Aloy [32] and it contains 101 protein complexes derived using structure based protein matching

Page 11 of 22

withknown structures and screened with the electron microscopy method. The third dataset is the Pu dataset [33] and it contains 408 protein complexes.

an

us

cr

ip t

For the human organism, the evaluation of the computational methods outcome is much more difficult due to the limited existing knowledge about human PPIs and protein complexes. However, in this paper we made an effort to gain an insight aboutthe quality of the clustering results by building three different protein complex datasets by filtering the protein complexes which are published in CORUM [34]. The first human dataset (443 protein complexes) was created by filtering out all protein complexes which include a protein that is not present in interactions annotated in the HPRD database. The second human evaluation dataset (1097 protein complexes) consists of protein complexes in CORUM when filtering out all complexes with more than half of their proteins not included in the HPRD PPIs. These datasets were used to evaluate the performance of the proposed methodology and the state-of-the-art algorithms when applied to the second PPI dataset (HPRD PPI graph). The third dataset is the full CORUM protein complex dataset of the human organism containing (1234) protein complexes. 2.4 Evaluation Metrics

d

M

For the sake of evaluating the clustering results of the EE-MC method versus the other methods, we initially deployed a set of three different statistical metrics which are called sensitivity, positive predictive value (PPV) and geometric accuracy.

Ac ce p

te

When comparing a clustering outcome with a benchmark protein complex dataset, supposing n is the cardinality of the benchmark complexes and m is the cardinality of the predicted ones. Let Τij represent the number of shared proteins between the ith benchmark complex and the jth predicted one. The formulas for the examined statistical metrics are described in equations (10), (11) and (12). n

Sensitivity =

∑ max {T } j

i =1

ij

(10)

n

∑N i =1

i

m

PPV =

∑ max {T } i

j =1

ij

(11)

m

∑T j =1

j

GeometricA ccuracy = Sensitivit y • PPV

(12)

where Ni denotes the number of proteins belonging to the i-th benchmark complex and Tj denotes the total number of members of the j-th predicted cluster assigned to all benchmark complexes.

Page 12 of 22

cr

ip t

The metrics of sensitivity and PPV are competitive ones. When the clustering outcome contains large clusters then the sensitivity score is high opposed to the one of the PPV that will be low. By contrast, when the clustering outcome contains small clusters then the sensitivity score is low and the PPV score is high. The geometric accuracy aims in balancing the tradeoff between the two metrics of sensitivity and PPV by incorporating both of them in a single geometric mean. However, geometric accuracy metric is facing a strong limitation. As already mentioned, proteins participate in more than one protein complexes. Thus, many protein complexes share proteins and some of them are extremely similar to each other. The outcome of this observation is that the aforementioned metrics offer overestimated evaluation of the clustering results.

us

To overcome this hurdle we have also used the separation metric [35] which takes into consideration the fact that clustering predictions with fewer known complexes should be regarded as the ones with the higher quality.

an

3 Results

M

The EE-MC method was initially applied to the yeast weighted PPI graph described in section 2.1. The comparative results conducted on the yeast datasets are considered the most valuable ones as the yeast organism is one of the most thoroughly studied organisms in terms of PPIs and protein complexes.

Ac ce p

te

d

Because of the stochastic nature of EE-MC we run it 20 times and calculated the mean values for its evaluation metric. The performance of EE-MC was compared with the performance of two of the most well-established and frequently used methods for the prediction of PPI graphs: MCL and RNSC. Their results were calculated using the Superclusteroid Tool [36] which uses for their parameters the optimized values as described in [35]. In Figure 4 we present the evaluation metrics of the aforementioned algorithms when evaluating their outcome with the yeast benchmark datasets described in 2.3.

Page 13 of 22

ip t cr us an M d

te

Figure 4: Comparative results (yeast organism)

Ac ce p

The best run of the EE-MC algorithm extracted 519 protein complexes and provided the following prices for the parameters which we try to optimize: inflation rate: 2.770, best neighbor threshold: 0.8096, haircut parameter: 0.6533, density threshold: 0.1680 and cutting edge threshold: 0.1035. To further validate the performance of the EE-MC method we conducted experiments using a subgraph of the human PPI graph containing proteins and protein interactions limited to a filtered edition of the HPRD interactions (as described in the section 2.1). We focused on this part of the PPI graph as it is the best studied part of it and it could be used to extract meaningful comparative results. In Figure 5 we present the experimental results for this dataset using as benchmark protein complex datasets the ones described in section 2.3.

Page 14 of 22

ip t cr us an M d te Ac ce p

Figure 5: Comparative results for the human organism

The best run of the EE-MC in the HPRD dataset predicted 1756 protein complexes and provided the following values for the parameters which we try to optimize: inflation rate: 1.8426, best neighbor threshold: 0.9883, haircut parameter: 0.0557, density threshold: 0.0459 and cutting edge threshold: 0.0723. After proving the ability of the EE-MC algorithm to predict protein complexes in datasets for which high confidence benchmark protein complex datasets exist to measure its performance, the EE-MC algorithm was applied to predict EE-MC algorithm in the full human PPI graph which is derived using a computational method (as described in section 2.4). The evaluation dataset for this graph is very small as it contains only 1234 protein complexes for more than 200.000 PPIs. Moreover, experimental comparisons were not applicable as existing implementations (including Superclusteroid) fail to handle such an enormous network. EE-MC implementation

Page 15 of 22

us

cr

ip t

however was able to analyze this PPI graph. In table 1, EE-MC experimental results are presented when it was applied to predict protein complexes from the whole human PPI graph. In addition to the performance metrics which were used in the previous experiments, we computed the percentage of predicted clusters which were characterized as functionally enriched, using the hupergeometric distribution [37] and the functional annotation of GO database. The GO molecular function terms were filtered to discard generic terms such as DNA bindingand RNA Binding which characterize a large number of proteins. With this procedure we came up with a set of 3432 specific GO functional terms. Although this metric could be characterized as biased because functional information was incorporated in the PPI graph through its weights, this was used to validate the fact that EE-MC method is able to extract the knowledge that lies within a PPI network and its weights. The EE-MC predicted 5737 protein complexes and 72.58% of them were found to be functionally enriched with at least one specific GO functional term.

14,8391

72,58%

an Separation

Percentage of functionally enriched clusters

Sensitivity

PPV

EE-MC

32,3125

31,7256

32,9103

M

Method

Geometric accuracy

d

Table 1. Experimental results of EE-MC when applied to the whole computationally predicted human PPI network

Ac ce p

te

To further evaluate the performance of the proposed algorithm we compared it to the methodology deployed in the PCD-q database [38] in order to validate its ability to locate functionally characterized protein complexes. Moreover, in the present study we used the same filtering procedure as followed in [38] to include GO-terms with higher depth than 5 in the GO hierarchy. Using this approach, it is ensured that only specific GO-terms are considered. Experimental results presented in Table 2 indicate the higher functional enrichment rate in EE-MC predicted protein complexes compared to the complexes included in the PCD-q database. Method

Functionally characterized complexes using GOterms with higher than 5 depth in the GO hierarchy

PCD-q

450/1264 (35.6%)

2479/5737 (43.21%) EE-MC Table 2.Comparative results with PCD-q in functionally characterizing human protein complexes 4 Discussion

Page 16 of 22

an

us

cr

ip t

The experimental results on predicting protein complexes in the yeast organism (Figure 4), indicate that the proposed methodology was not able to enhance significantly the performance when measuring it with the classical performance metrics of geometric accuracy, sensitivity and PPV. Concerning the geometric accuracy the EE-MC methodology does not outperform existing methodologies and the differences between the examined algorithms are not statistically important. Considering the fact that the parameters which were used for the MCL and RNSC parameters are the ones proposed by [35], which have been optimized for the yeast network, it is also important that our algorithm using an unsupervised technique achieved similar results to the classification metrics which were used as fitness functions for the supervised optimization of MCL’s and RNSC’s parameters. Moreover, EE-MC was able to improve significantly the separation results. As described in section 2.4 this is the most important metric for evaluating clustering in PPI graphs as it is the only measure which takes into account the overlap between protein complexes. The improvement of the separation metric of EE-MC compared to the state-of-the-art algorithms ranged from ~7% to ~39%. Thus, EE-MC provides an unbiased solution for network clustering which significantly outperforms existing solutions when overlapping clustering is required.

Ac ce p

te

d

M

From the comparative experimental results using human datasets (Figure 5), it is observed that the proposed methodology slightly improved the results in both datasets in terms of the examined metrics (geometric accuracy and separation). However, this improvement was not proved to be statistical important when checked using the two sampled t-test with 5% confidence level. This may be attributed to the absence of overlapping clusters in the applied evaluation datasets that made the proposed methodology to lose one of its most important advantages. With a closer examination of the optimized parameters, which are obtained by the best run of the EE-MC in the HPRD dataset, it is easily observed that our unsupervised evolutionary framework was able to understand the nature of the PPI graph which it tries to analyze. This PPI graph was not highly connected and the protein complexes that exist in it are not overlapping ones. Thus, by selecting carefully the parameters of the EMC filters the proposed method achieved to switch from EMC to a method close to classical MCL algorithm which seems to be appropriate for analyzing this dataset. The comparison of the EE-MC approach against the PCD-q method for the functional characterization of human protein complexes validated the superiority of the proposed approach when considering the proportion of functionally enriched predicted complexes. 5 Conclusions

Proteins and their interactions are considered to be extremely significant for understanding cellular mechanisms and the cellular procedures which lead to diseases. Moreover, diseases are most frequently attributed to perturbations within their complex intracellular network [39]. PPIs and PPI graphs are the basic knowledge

Page 17 of 22

required to perform protein functional annotation and pathway analysis. Protein complexes are structural complexes which are formed to perform specific cellular functions. For this reason, understanding the intracellular mechanisms requires the accurate prediction and functional characterization of all protein complexes.

M

an

us

cr

ip t

In the present paper we have proposed a novel unsupervised algorithm called EE-MC for the computational prediction of protein complexes on PPI graphs. EE-MC is theoretically able to overcome intrinsic limitations of existing methodologies such as their inability to handle weighted PPI networks, their constraint to assign every protein in exactly one cluster and the difficulties they face concerning the parameter tuning. To validate this theoretical superiority of EE-MC we have deployed it to predict PPIs in an experimentally verified PPI graph of yeast organism and an adequately studied part of the experimentally verified PPI network of Human organism. In the latter, weights were added to its edges using a computational technique which uses functional and structural information. The results of EE-MC were compared with corresponding results of the well established methodologies of MCL and RNSC when using various benchmark protein complex sets. As indicated in the experimental results, EE-MC clearly outperformed in the separation metric existing methodologies in the yeast dataset and has marginally improved the performance both geometric accuracy and separation metrics of MCL and RNSC in the human dataset.

Ac ce p

te

d

After proving the efficiency of the EE-MC algorithm we have applied it to a large human PPI network which was build using a computational method for the prediction and scoring of human PPIs. To the best of our knowledge, only a few studies so far have emphasized on predicting human protein complexes and most of them deployed small scale PPI networks. In this study we have suggested that the utilization of a fully unsupervised method for predicting protein complexes from a large scale PPI network is able to predict protein complexes of significant importance. Although the lack of sufficient knowledge about human protein complexes restricts us from evaluating effectively the predicted protein complexes, after an examination of their functional annotations we found that 72.58% of them were functionally enriched with at least one GO molecular function term. Moreover, the ability of the proposed methodology to extract functionally enriched protein complexes was compared with methodology followed in [38]. Experimental results indicated that the protein complexes significantly outperformed the method used in the PCD-q database concerning the proportion of functionally enriched predicted complexes. These findings encourage us to further study these potentially true protein complexes which should be experimentally validated. An interesting future direction would be to use the clustering results of EE-MC to functionally characterize proteins of unknown function. Furthermore, our future plans include the incorporation of the clustering results of EE-MC in a recently developed knowledge base, HINT-KB [26], to make them easily accessible for the scientific community. Another interesting idea for future research is the application of the

Page 18 of 22

ip t

proposed evolutionary framework to optimize the clustering results of RNSC. However, this solution would present extremely high computational cost as both RNSC and the proposed evolutionary framework are complex meta-heuristic approaches. Nevertheless, this solution will be feasible using high-performance computing architectures such as cloud, grid and clusters. Finally, as new experimental data for human protein complexes emerge at an increasing rate, we should be able in the near future to validate our predictions using more reliable protein complex evaluation datasets.

us

cr

Acknowledgments.This research has been co-financed by the European Union (European Social Fund - ESF) and Greek national funds through the Operational Program "Education and Lifelong Learning" of the National Strategic Reference Framework (NSRF) - Research Funding Program: Heracleitus II. Investing in knowledge society through the European Social Fund.

an

References

O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, et al., The Tandem Affinity Purification (TAP) Method: A General Procedure of Protein Complex Purification, Methods 24 (2001) 218-229. [2] D. Auerbach, S. Thaminy, M. Hottiger, and I. Stagljar, The post-genomic era ofinteractive proteomics: facts and perspectives, Proteomics 2 (2002) 611-623. [3] A. Kumar and M. Snyder, Protein complexes take the bait, Nature 415 (2002)123-124. [4] G. MacBeath and S. Schreiber, Printing proteins as microarrays for highthroughput function determination, Science (New York, N.Y.) 289 (2000) 1760-1763. [5] K. Theofilatos, C. Dimitrakopoulos, A. Tsakalidis, S. Likothanassis, S.Papadimitriou, S. Mavroudi, Computational Approaches for the Predictions of Protein-Protein Interactions: A Survey, Current Bioinformatics 6(4) (2011) 398-414. [6] J.D. Han, N. Bertin, T. Kao, D.S. Goldberg, G.F. Berriz, L.V. Zhang, et al., Evidence for dynamicallyorganized modularity in the yeast protein-protein interaction network, Nature 430 (6995) (2004) 88-93. [7] G Liu, L. Wong, H. N. Chua, Complex discovery from weighted PPI networks, Bioinformatics 25(15) (2009) 1891-1897. [8] Y. Qi, F. Balem, C. Faloutsos, J. Klein-Seetharman, and Z. Bar-Joseph, Proteincomplex identification by supervised graph local clustering, Bioinformatics 24(13) (2008)i250-i268. [9] T. Ideker, O. Ozier, B. Schwikowski, and A.F. Siegel, Discovering regulatory and Signalling circuits in molecular interaction networks, Bioinformatics 18 (Suppl. 1) (2002) S233-S240. [10] A.J. Enright, S. Van Dongen, C.A. Ouzounis, An efficient algorithm for large scale detection of protein families, Nucleic Research Acids 30(7) (2002) 15751584.

Ac ce p

te

d

M

[1]

Page 19 of 22

Ac ce p

te

d

M

an

us

cr

ip t

[11] A.D. King, N. Przulj, and I. Jurisica, Protein complex prediction via costbased clustering, Bioinformatics 20(17) (2004) 3013-3020. [12] G.D. Bader and C.W. Hogue, An automated method for finding molecularcomplexes in large protein interaction networks, BMC Bioinformatics 4 (2003)2. [13] X.L. Li, S.H. Tan, C.S. Foo and S. K. Ng, Interaction Graph Mining for ProteinComplexes Using Local Clique Merging, Genome Informatics 16(2) (2005) 260-269. [14] J. Hodgkin, Seven types of pleiotropy. Int. J. Dev. Biol. 42 (1998) 501–505. [15] S. Kühner, V Noort, M. Betts, A. Leo-Macias, C. Batisse, M. Rode, et al., Proteome organization in a genome-reduced bacterium. Science 326 (2010) 1235-1240. [16] Z. Xie, C. Kwoh, X. Li and M. Wu, Construction of co-complex score matrix forprotein complex prediction from AP-MS data, Bioinformatics 27 (2011) i159-i166. [17] J. Fan, J. Chen and S. Sze, Identifying Complexes from Protein Interaction Networks According to Different Types of Neighborhood Density, Journal of Computational Biology 19(12) (2012) 1284-1294. [18] M. Li, X. Wu, J. Wang and Y. Pan, Towards the identification of protein complexes and functional modules be integrating PPI network and gene expression data, BMC Bioinformatics, 13(109) ( 2012). [19] X. Lei, S. Wu, L. Ge and A. Zhang, Clustering and overlapping modules detectionin PPI network based on IBFO, Proteomics 13 (2013) 278-290. [20] E. Becker, B. Robisson, C. Chapple, A. Guenoche and C. Brun, Multi functional proteins revealed by overlapping clustering in protein interaction network,Bioinformatics 28(1) (2012) 84-90. [21] C.N. Moschopoulos, G.A. Pavlopoulos, S.D. Likothanassis and S. Kossida, An enhanced Markov clustering method for detecting protein complexes, In Proceedings of the 8th IEEE International Conference on Bioinformatics and BioEngineering, Athens (BIBE2008), p.p. 1-6, publisher: IEEE, doi: 10.1109/BIBE.2008.4696656 (2008). [22] C. Moschopoulos, M. Fytros, S. Alatsathianos, S. Likothanassis and S. Kossida, GAPPI: Identifying Important Protein Modules Through ProteinProteinInteraction Graphs, International Journal on Artificial Intelligence Tools 21(6) (2012) 1250027-1250045. [23] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, et al., Gene ontology: tool for the unification ofbiology. The Gene Ontology Consortium, Nature Genetics 25 (2000) 25-29 (Accessed: 1 December 2012). [24] C. Friedel, J. Krumsiek and R. Zimmer, Bootstrapping the Interactome: Unsupervised Identification of Protein Complexes in Yeast, Journal of Computational Biology 16(8) (2009) 971-987. [25] T.S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, et al., Human Protein Reference Database—2009 update, Nucleic Acids Research 37 (2009) D767-D772 (Accessed: 1 December 2012).

Page 20 of 22

Ac ce p

te

d

M

an

us

cr

ip t

[26] K. Theofilatos, C. Dimitrakopoulos, S. Likothanassis, D. Kleftogiannis, C. Moschopoulos, C. Alexakos, et al., The Human Interactome Knowledge Base (HINT-KB): an integrative human protein interaction database enriched with predicted protein–protein interaction scores using a novel hybrid technique, Artificial Intelligence Review (2013) 1-17, doi: R 10.1007/s10462-013-9409-8 (Accessed: 1 December 2012). [27] V. Spirin and L. Mirny, Protein complexes and functional modules in molecular networks, PNAS 100(21) (2003) 12123-12128. [28] J. Holland, Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence, (Cambridge:Mass, MIT Press, 1995). [29] F. Neumann, P. Oliveto and C. Witt, Theoretical Analysis of FitnessProportional Selection: Landscapes and Efficiency, In Proceedings ofGECCO’09, July 8-12, 2009, Montreal, Quebec, Canada, publisher: ACM. [30] S. Picek, M. Golub, Comparison of a Crossover Operator in Binary-coded Genetic Algorithms, WSEAS Transactions on Computers 9(9) (2010) 10641073. [31] D. Thierens, Adaptive Mutation Rate Control Schemes in Genetic Algorithms, In Proceedings of the 2002 IEEE World Congress on Computational Intelligence: Congress on Evolutionary Computation 1 (2002), 980–985, doi: 10.1109/CEC.2002.1007058, publisher: IEEE. [32] P. Aloy, B. Bottcher, H. Ceulemans, C. Leutwein, C. Mellwig, S. Fischer, et al., Structure-based assembly of protein complexes in yeast, Science 303 (2004) 2026-2029. [33] S. Pu, J. Vlasblom, A. Emili, J. Greenblatt and S. Wodak, Identifying functional modules in the physical interactome of Saccharomyces cerevisiae. Proteomics 7 (2007) 944–960. [34] A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G.Fogo, et al., CORUM: the comprehensive resource of mammalian protein complexes—2009, Nucleic Acids Research 38(database issue) (2009) D497501 (Accessed: 1 December 2012). [35] S. Brohee S, J. van Helden, Evaluation of clustering algorithms for protein protein interaction networks, BMC Bioinformatics 7 (2006), 488. [36] A. Ropodi, N. Sakkos, C. Moschopoulos, G. Magklaras, S. Kossida, Superclusteroid: a Web tool dedicated to data processing of protein-protein interaction networks, EMBnet journal 17(2) (2011) 10-15. [37] I. Rivals, L. Personnaz, L. Taing and M.C. Potier, Enrichment or depletion of aGO category within a class of genes: which test?, Bioformatics 23 (2007) 401–407. [38] S. Kikugawa, K. Nishikata, K. Murakami, Y. Sato, M. Suzuki, M. Altaf-UlAmin, et al., PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from H-Invitational protein-protein interactions integrative dataset., BMC Systems Biology, 6(Suppl 2) (2012) S7.

Page 21 of 22

Ac ce p

te

d

M

an

us

cr

ip t

[39] A. Barabasi, N. Gulbahce and J. Loscalzo, Network medicine: a networkbased approach to human disease, Nature Reviews Genetics, 12 (2011) 56-68.

Page 22 of 22

Functional clustering of immunoglobulin superfamily proteins with protein-protein interaction information calibrated hidden Markov model sequence profiles.

Detecting overlapping protein complexes by rough-fuzzy clustering in protein-protein interaction networks.

Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering.

Predicting human protein subcellular locations by the ensemble of multiple predictors via protein-protein interaction network with edge clustering coefficients.

Collaborative fuzzy clustering from multiple weighted views.

Detecting protein complexes based on relevancy from protein interaction networks.

A critical analysis of computational protein design with sparse residue interaction graphs.

Structural and evolutionary versatility in protein complexes with uneven stoichiometry.

A method for predicting protein-protein interaction types.

Graphs and matroids weighted in a bounded incline algebra.

Evolutionary Games of Multiplayer Cooperation on Graphs.

Unsupervised tissue segmentation from dynamic contrast-enhanced magnetic resonance imaging.

A comparison of weighted ensemble and Markov state model methodologies.

Network Lasso: Clustering and Optimization in Large Graphs.

Predicting Treatment Relations with Semantic Patterns over Biomedical Knowledge Graphs.

Distance-wise pathway discovery from protein-protein interaction networks weighted by semantic similarity.

A density-based clustering approach for identifying overlapping protein complexes with functional preferences.

A Novel Clustering Methodology Based on Modularity Optimisation for Detecting Authorship Affinities in Shakespearean Era Plays.

Detecting protein complexes in protein interaction networks using a ranking algorithm with a refined merging procedure.

phyC: Clustering cancer evolutionary trees.

GibbsCluster: unsupervised clustering and alignment of peptide sequences.

Unsupervised clustering and centroid estimation using dynamic competitive learning.

Exploration of the dynamic properties of protein complexes predicted from spatially constrained protein-protein interaction networks.

Improving protein-protein interaction prediction using evolutionary information from low-quality MSAs.