YGENO-08679; No. of pages: 8; 4C: Genomics xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

F

O

5 6 7 8 9

a

1 0

a r t i c l e

11 12 13 14

Article history: Received 19 April 2014 Accepted 2 October 2014 Available online xxxx

15 16 17 18 19 20

Keywords: Protein–protein interaction Machine learning Classifier fusion Ensemble learning Cellular localization

Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran Department of Mathematics, K.N. Toosi University of Technology, Tehran, Iran School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran d Brain and Intelligent Systems Research Lab, Department of Electrical and Computer Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran e Department for Bioinformatics and Computational Biology, Faculty of Informatics, TUM, Garching 85748, Germany

R O

b

i n f o

P

c

a b s t r a c t

Protein–protein interaction (PPI) detection is one of the central goals of functional genomics and systems biology. Knowledge about the nature of PPIs can help fill the widening gap between sequence information and functional annotations. Although experimental methods have produced valuable PPI data, they also suffer from significant limitations. Computational PPI prediction methods have attracted tremendous attentions. Despite considerable efforts, PPI prediction is still in its infancy in complex multicellular organisms such as humans. Here, we propose a novel ensemble learning method, LocFuse, which is useful in human PPI prediction. This method uses eight different genomic and proteomic features along with four types of different classifiers. The prediction performance of this classifier selection method was found to be considerably better than methods employed hitherto. This confirms the complex nature of the PPI prediction problem and also the necessity of using biological information for classifier fusion. The LocFuse is available at: http://lbb.ut.ac. ir/Download/LBBsoft/LocFuse. Biological significance: The results revealed that if we divide proteome space according to the cellular localization of proteins, then the utility of some classifiers in PPI prediction can be improved. Therefore, to predict the interaction for any given protein pair, we can select the most accurate classifier with regard to the cellular localization information. Based on the results, we can say that the importance of different features for PPI prediction varies between differently localized proteins; however in general, our novel features, which were extracted from position-specific scoring matrices (PSSMs), are the most important ones and the Random Forest (RF) classifier performs best in most cases. LocFuse was developed with a user-friendly graphic interface and it is freely available for Linux, Mac OSX and MS Windows operating systems. © 2014 Published by Elsevier Inc.

D

4

Javad Zahiri a,b, Morteza Mohammad-Noori c, Reza Ebrahimpour d, Samaneh Saadat a, Joseph H. Bozorgmehr a, Tatyana Goldberg e, Ali Masoudi-Nejad a,⁎

E

3Q2

T

2

LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information

1. Introduction

48 49 50 51 52 53 54 55

Proteins carry out nearly all cellular functions through their interactions with other proteins [1,2]. Knowledge about protein–protein interactions (PPI) can provide significant insights into the underlying mechanisms of biological processes and gene functions within a cell. Therefore, PPI detection has become an important topic in functional genomics and systems biology [3–9] and PPI information can be used to provide functional annotation [10]. Two main technologies for investigating protein–protein interactions are yeast two-hybrid (Y2H) [11] and co-immunoprecipitation (Co-IP) [12]. The premise behind the yeast two-hybrid method is that

U

46 47

N C O

R

R

E

C

1

⁎ Corresponding author. Fax: +98 21 6640 4680. E-mail address: [email protected] (A. Masoudi-Nejad). URL:E-mail addresses:E-mail address: http://LBB.ut.ac.ir (A. Masoudi-Nejad).

21 22 23 24 25 26 27 28 29 30 31 Q14 32 33 34 35 36 Q3 37 38 39 40 44 42 41 43

in most eukaryotic transcription factors, the activating and binding domains are modular and can function in proximity to each other [13]. This method employs a transcription factor, traditionally the yeast transcription factor GAL4, and then two proteins of interest are fused to activating and binding domains of the transcription factor. If the two proteins interact with each other, the transcription of the reporter gene will be induced and the interaction can be detected. Co-IP is used to identify interaction between two proteins or entire protein complexes in vitro. In this method the protein of interest is targeted with an antibody. By targeting the protein it may become possible to pull the entire protein complex down of solution. This protein complex can then be analyzed to identify new interacting partners of the protein of interest. Recent advances in high-throughput technologies provide massive protein–protein interaction data for various organisms. Although these experimental approaches produce crucially valuable data, they have noticeable limitations [6,14,15]: these approaches are

http://dx.doi.org/10.1016/j.ygeno.2014.10.006 0888-7543/© 2014 Published by Elsevier Inc.

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

56 45 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

127 128 129 130 131 132 133 134

C

95 96

E

93 94

R

91 92

R

89 90

O

87 88

C

85 86

N

83 84

U

81 82 Q4

Each pair of interacting proteins was transformed into eight different feature vectors to represent various genomic and proteomic information briefly described in the following, for formal definitions refer to Sections 1.1.1–1.1.5 of [50]

138

2.1.1. Post-translational modification (PTM) Considering the important role of PTMs we coded each protein pair as a 620-length feature vector, in which every element corresponded to a PTM type on a residue taken from the HPRD database [51]. There are 31 different PTM types listed in the HPRD database for human proteins, extracted from literature. We encoded each protein pair P1–P2 in our dataset as a feature vector PTM = (ptm1, ptm2,…, ptm620), where ptmi ∈ {0, 1, 2}. In this vector, ptmi equals 2 if both P1 and P2 undergo ith PTM, it is equal to 1 if either P1 or P2 undergoes ith PTM, and it equals to 0 otherwise.

142 143

2.1.2. Tissue information (TSU) For each protein pair, we encoded tissue information in a 582-length feature vector TSU = (tsu1,tsu2,…,tsu582) with tsui ∈ {0, 1, 2}, based on the information mentioned in HPRD annotations. We set tsui = 2 if both proteins are expressed in tissue i, tsui = 1 if one of two proteins is expressed in tissue i, and tsui = 0 otherwise.

152

2.1.3. Codon usage (CDN) Based on the results of Saunders and Deane [40], we selected 29 codons that significantly influence proteins' secondary structure and used as elements for the construction of a feature vector for encoding protein pairs.

158 159

2.1.4. PSSM-based features In this paper we used three types of PSSM-based features to predict PPI. The first two features, PSSM8 and PSSM64, are modified versions of features that have been proposed in our recent work [30]. The third feature type, protein sequence and consensus sequence hybridization (SCSH), is a novel feature that can be extracted from sequence and evolutionary information contained in the PSSM of each protein. To create a PSSM, we used the Position Specific Iterated BLAST (PSI-BLAST) tool [52] with three iterations and Evalue b10− 3 against the NCBI non-redundant (NR) dataset [53]. After all, we used four different PSSM-based features for PPI prediction: PSSM8; PSSM64; SCSH2 and SCSH3 (see Section 1.1.4 and Fig. 2 of [50] for precise definition).

163

2.1.5. GO-based similarity (GO) To compute GO feature vectors, the biological process and cellular localization annotations for all proteins have been extracted from UniProtKB and then the simGIC [54] similarity scores for each protein pair (in two mentioned categories) have been computed. Finally each protein pair has been coded as a two-dimensional feature vector, in which elements corresponded to GO similarity scores. To compute simGIC for protein pairs ProteInOn tool [55] was used.

176

2.2. Preprocessing and normalization of feature vectors

184

A three-step preprocessing has been done on feature vectors: removing non-informative features; normalization and feature selection. In the first step each feature that has an identical value in 99% (or greater) of instances, is considered as a non-informative feature and has been removed.

185 186

F

The results of recent studies show the complexity of the computational PPI prediction problem in humans [8,27]. In similar situations, in encountering a complex classification problem, we can find successful application of ensemble learning methods in many fields [46–48]. An ensemble learning method combines the decisions of an ensemble of experts (base classifiers) in the hope that the new decision will be better than the individual ones. In this study, we used eight different features to code protein pairs and in each feature space four base classifiers: RF; NB; MLP and RBF have been trained. To run base classifiers we

79 80

137

O

126

78

2.1. Feature extraction

R O

2. Materials and methods

76 77

P

125

74 75

used WEKA software package [49] which is a comprehensive open- 135 source library of data mining and machine learning methods. 136

T

124

biased towards well-studied proteins and so the available interactomes of many organisms are far from being complete [16,17] (there are many false negative interactions). Considering these limitations, computational PPI prediction methods have attracted tremendous attention among researchers from biology and bioinformatics communities. In recent years, many computational methods for PPI detection have been proposed; such methods can be considered as a computational complement for the high-throughput experimental methods to extract true protein interactions [18,19]. More than 100 PPI related repositories have been published and are available online [8,20], and these databases can be used as the major data for evaluating prediction methods. There are several computational approaches for PPI prediction that use different genomic and proteomic information, such as structural information [21,22]; gene neighboring [23]; phylogenetic relationship [24]; network topology [25]; literature mining [26] and heterogeneous genomic/proteomic features utilizing machine learning algorithms [4,27–30]. In this study, we propose a novel classifier fusion method to predict human PPI. We exploit different feature types to encode protein pairs that include: 1 — post-translational modification: after translation, various post-translational modifications (PTMs) fine-tune the functions of almost all eukaryotic proteins [31]. Therefore, PTMs are crucial for the control of several important cellular processes [32] and affect the PPIs [33,34]. 2 — tissue information: dependent on a tissue a protein is expressed in it may have different functions [35] and interacting partners [36]. As a result, different tissues may have different active PPI networks [35,37,38]. 3 — codon usage: synonymous codon usage is correlated with the gene function in higher organisms such as human [39]. Many synonymous codons vary in their propensity for protein secondary structures and thus may influence the protein's structure [40,41]. 4 — evolutionary information (four feature types): four evolutionary features that extracted from position-specific scoring matrices (PSSMs). 5 — similarity of Gene Ontology (GO) annotations: similarity in Gene Ontology (GO) [42] is a strong indicator for PPI [43,44]. In this study, we used the semantic similarity between GO annotations to encode each protein pair. We train four different classifiers: random forest (RF); Naïve Bayes (NB); multi-layer perceptron (MLP) and radial basis function network (RBF), for details about these classifiers refer to [45], in all eight feature spaces. For every cellular compartment a protein pair is localized to determine a classifier that is a specialist for predicting an interaction. We show that the prediction performance of a proposed method is considerably better than the current state-of-the-art PPI prediction methods. In addition, the results show that popular ensemble learning methods in machine learning community, such as weighted voting; decision template; stacking; bagging and AdaBoost, for details about these methods refer to [46], have performance less than the best single classifier and so cannot work well in PPI prediction; this result shows the complex nature of the problem and the necessity of using biological knowledge for classifier fusion. Finally, we can say that the importance of different features for PPI prediction varies between differently localized proteins. Generally, our novel PSSM-based features are the most important ones and the RF classifier is the most accurate classifier in most cases.

D

72 73

J. Zahiri et al. / Genomics xxx (2014) xxx–xxx

E

2

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

139 140 141

144 145 146 147 Q5 148 149 150 151

153 154 155 156 157

160 161 162

164 165 166 167 168 169 170 171 172 173 174 175

177 178 179 180 181 182 183

187 188 189

J. Zahiri et al. / Genomics xxx (2014) xxx–xxx

In the second step, to remove biases, such as protein length effect, we normalized feature values as the following:   f i; j ¼ f i; j −min =ð max  LÞ

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 Q6 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249

C

214 215

E

212 213

R

210 211

R

208 209

N C O

206 207

U

204 205

255 256

2.5. Independent test set

286

In addition to SD and RP datasets, we used the CORUM database (released in February 2012) [68], which is a comprehensive resource of mammalian protein complexes, to construct an independent test set for evaluating LocFuse on it. There were 1846 human protein complexes in CORUM. All complexes were obtained from individual experiments published in scientific articles and data from high-throughput experiments is excluded. All protein pairs in the same complex were considered as interacting proteins. Thus, the number of interacting protein pairs, in which all above-mentioned features for those are available, was 19,090.

287

2.6. Performance assessment

297

Considering benchmark datasets, the prediction performance can be assessed with different measures based on four basic parameters: TP (True Positive) that is the number of interactions correctly predicted; TN (True Negative) that is the number of non-interaction correctly predicted; FP (False Positive) that denotes the number of non-interaction incorrectly predicted as interaction; FN (False Negative) that denotes the number of interactions incorrectly predicted as non-interaction. These were used to calculate precision, recall, F-measure and area under the ROC curve (AUC). Recall is a fraction of real positive interactions correctly identified by a predictor. Precision is a fraction of positive interaction predictions that are correct. F-measure is the balance between precision and recall, and the accuracy is the proportion of correct

298

F

We used four different classifiers random forest (RF); Naïve Bayes (NB); multi-layer perceptron (MLP) and radial basis function network (RBF) in each of the eight mentioned feature spaces, which resulted in 32 different base classifiers (Fig. 1). Additionally, we exploited the information about the subcellular localization of interacting proteins and found it to be improving the classification accuracy significantly. Localization information was extracted from the human protein atlas (HPA) [57] and in the case of missing annotations it was predicted with a recently published localization prediction tool LocTree2 [58]. Considering localization information, proteins were clustered into five clusters, LocClusts (see Table 2 of [50] for details): non-membrane-non-secretory (NMNS); non-membrane-secretory (NMS); membrane-non-secretory (MNS); membrane-secretory (MS) and Unknown (UNK). According to the clustering a protein pair can be categorized in 15 different clusters (e.g. MS–MS, NMS–MS, and MS–MNS). As the result, our final multiple classifier system was composed as an ensemble of 32 × 15 base classifiers for PPI prediction (Fig. 1). To select the best classifiers for labeling a protein pair as interacting or not in each of the 15 localization clusters, we applied a classifier selection approach (Fig. 2). Based on the best training positive predictive value (PPV or precision) and negative predictive value (NPV) (see Section 1.3 of [50] for definitions of PPV and NPV) of classifiers in each Loc-Clust, two classifiers have been selected as interaction and noninteraction experts, respectively. We denote these experts by EXP0 and EXP1 and their predicted labels by Y0 and Y1, where Yi ∈ {0, 1} with 0 denoting non-interaction and 1 interaction. In case the predictions of EXP0 and EXP1 are not-conflicting, i.e. Y1 = Y2, the label of the input protein pair is simply being assigned. Otherwise, the conflict is being resolved by computing the expertness of EXP0 and EXP1 for protein pairs having similar GO cellular compartment annotation to input protein pair: First, k-nearest neighbors of input protein pair, P1 and P2, are being determined based on GO cellular compartment annotations, which are k protein pairs and that their annotations are most similar to P1 and P2) P2. We denote these k protein pairs by NGO(P1, . Then, confusion maK P2) trices of EXP0 and EXP1 in NGO(P1, are computed and according to Y0 K and Y1, considering performances of EXP0 and EXP1 one of the two conflicting opinions is being accepted. If Y0 = 0 and Y1 = 1, we use NPV and PPV as expertness of EXP0 and EXP1, respectively; if Y0 = 1 and Y1 = 0, then we use PPV and NPV as expertness of EXP0 and EXP1 for input proteins (the best value for k was 200). In a similar manner, if input proteins have no localization information and GO annotation then expertness of EXP0 and EXP1 is computed, but to determine the k-nearest neighbors of the input proteins the corresponding feature spaces of EXP0 and EXP1 are used instead of GO. For example, if the input protein pair belongs to Loc-Clust9 and, EXP0 and EXP1 are NB and RF in PTM and TSU feature spaces, respectively; then the NPV of the NB, which have been trained on the PTM feature space, in the k-nearest neighbors of the input proteins in the PTM feature space is considered as the

202 203

For validating the prediction method, we compiled gold standard datasets of interacting and non-interacting protein pairs. Protein interactions were taken from HPRD, which is a manually curated database of human protein interactions where every entry is linked to the corresponding experimental evidence. Interactions confirmed by at least two experiments built our positive dataset (discarding self-interactions): unique 12,175 interactions and 5630 proteins. A more challenging aspect of PPI gold standard construction is the selection of negative PPIs. There are conflicting results in the literature about the correct methodology for negative interaction selection, and it has been shown that some methodologies may lead to overestimating the prediction performance [30,59–61]. A uniform random pair selection has been reported to be less biased [62] in addition, a negative dataset with the same topology as that of the positive dataset (the number of occurrences of each protein in the negative dataset is the same as in the positive dataset) produce a dataset that is difficult for the computational PPI prediction methods [30,59,61,63]. Therefore, to select confident noninteracting protein pairs (NP), we first selected all protein pairs which have been tested in an experiment, based on PubMed ids of experiments that were mentioned in HPRD database, and not reported as interaction in any of the five most comprehensive human PPI databases: HPRD, BioGRID [64], IntAct [65], MINT [66] and DIP [67]. The NP set contained about 3 million protein pairs. Then we used two above-mentioned negative selection strategies to construct two datasets RP and SD for benchmarking LocFuse and other PPI prediction methods. RP: 12,175 interactions and 12,175 non-interactions were selected randomly from the NP set. SD: 12,175 interactions and 12,175 non-interactions were selected in a way such that each protein had the same number of occurrences as in the positive dataset.

O

201

254

R O

2.3. Prediction method LocFuse

196 197

2.4. Gold standard dataset

P

200

194 195

T

198 199

where fi,j is value of the ith feature in the jth protein pair and, min and max show the minimum and maximum of the feature values [56]; here the variable L in the denominator shows the protein length (this normalization has been done in PSSM based features). In the third step, to deal with the curse of dimensionality, in each D-dimensional feature space where D ≥ 500 the 100 most informative features, based on information-gain ratio [45], were selected to represent protein pairs.

250 251

D

193

ð1Þ

expertness of the EXP0. Analogously, the PPV of the RF, which have trained on the TSU feature space, in the k-nearest neighbors of the input proteins in the TSU feature space is considered as the expertness of the EXP1.

E

190 191

3

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

252 253

257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285

288 289 290 291 292 293 294 295 296

299 300 301 302 303 304 305 306 307 308 309

J. Zahiri et al. / Genomics xxx (2014) xxx–xxx

F

4

312 313

into five subsets with equal size. To ensure that the training process is completely independent of the test data, the methods were trained on four subsets and tested on the remaining one. The subsets were rotated five times, such that each subset was used for both training and testing, and each protein pair was used for testing exactly once.

U

N

C

O

R

R

E

C

T

E

D

P

314

predictions. The area under the ROC curve plots true positive rate versus false positive rate. The closer the AUC to one the more accurate is the classification. We used five-fold cross validation to compute performance estimates. The whole datasets (RP and SD) were randomly partitioned

R O

310 311

O

Fig. 1. Construction of the ensemble of classifiers. Using the 8 different feature spaces and localization information the final ensemble is constructed.

Q1

Fig. 2. a. The workflow of LocFuse. To assign a label to input protein pair P1 and P2, firstly, the interaction and non-interaction experts, EXP1 and EXP0, are being determined based on cellular localization clusters (Loc-Clust) for those proteins. If two experts have not-conflicting predictions, then the label of the input protein pair is simply being assigned. Otherwise, the conflict is being resolved based on positive predictive value (PPV) and negative predictive value (NPV) of EXP0 and EXP1 in k-nearest neighbors of the input protein pair in GO cellular P2) compartment annotation space (NGO(P1, ). b. Expert selection for input proteins (suppose that protein pair (P1, P2) belongs to Loc-Clust9). K

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

315 316 317 318 319

J. Zahiri et al. / Genomics xxx (2014) xxx–xxx

322 323

After training base classifiers, our analysis showed that some protein pairs can be considered as hard instances and the others as simple instances for different classifiers; with a deeper analysis we found that the majority of hard instances for different classifiers can be partitioned into few subsets based on cellular localization (results not shown). This analysis, and other increasing evidence [58,69,70] suggest that localization information can be used for PPI prediction. We used the Human Protein Atlas (HPA) localization database to assign localization to proteins, excluding localizations with ‘very low’ and ‘non-supportive’ reliability; for proteins without localization information (3053 proteins) we used a recently published localization prediction tool LocTree2. Using HPA and LocTree2 less than 700 proteins, for which LocTree2 was not able to provide a prediction, remained with ‘unknown’ localization. Because Loctree2 has been shown to be an accurate tool for distinguishing between trans-membrane and non-membrane proteins [58], we grouped our proteins based on localization information into five clusters Loc-Clusts

D E

337

T

335 336

C

333 334

E

331 332

R

329 330

344 345 Q7

R

327 328

3.1.1. PSSM-based features contribute most to PPI prediction We computed the accuracy of LocFuse for each Loc-Clust to show the difficulty of PPI prediction in different localizations, where lower accuracy means higher difficulty (Fig. 3.c). The results show that the PPI prediction in non-membrane proteins is the most difficult (in both SD and RP datasets). We also selected the feature types with highest accuracy in PPI prediction between different Loc-Clusts as the “most important features” (Fig. 3.d). The results show that different features contribute differently to the prediction of PPIs among different Loc-Clusts. However, in general, our novel PSSM-based features, SCSH2 and PSSM8, in combination with the RF classifier perform and work best in most cases. If we do not consider the localization information, the most important features are PSSM-based features in the SD and RP

N C O

325 326

U

324

F

3.1. Localization information for PPI clustering

338 339

O

321

(see Fig. 3.a and Section 1.2 of [50] for more details): non-membranenon-secretory (NMNS); non-membrane-secretory (NMS); membranenon-secretory (MNS); membrane secretory (MS) and unknown (UNK). The majority of all proteins (64%; Fig. 3.a) clustered in one single NMNS Loc-Clust, and also the majority of interactions (which is 58% of all interactions) belong to NMNS Loc-Clust (Fig. 3.b).

R O

3. Results and discussion

P

320

5

Fig. 3. a. Distribution of proteins in our benchmark datasets among different localization clusters (Loc-Clusts): non-membrane-non-secretory (NMNS); non-membrane-secretory (NMS); membrane-non-secretory (MNS); membrane-secretory (MS) and unknown (UNK). Localization clusters were constructed based on annotation information from the Human Protein Atlas (HPA) database and LocTree2 predictions (if HPA annotations were not available). b. Distribution of pairs of interacting proteins among five localization clusters. c. LocFuse performance (y-axis; accuracy) in predicting PPIs with different Loc-Clusts (x-axis). Both SD and RP datasets include 12,175 interacting protein pairs and the same number of non-interacting pairs. Interactions were taken from HPRD, which were confirmed by at least two experiments. Non-interactions for RP were selected randomly from the confident non-interacting protein pair (NP) set; for SD these were selected from NP in a way such that each protein had the same number of occurrences as in the interactions. d. “Most important features”: feature types and classifiers (given in parentheses) leading to the highest accuracy in PPI prediction for proteins with different localizations. Results shown only for Loc-Clust pairs with at least 100 members (interacting pairs of proteins). Abbreviations: SCSH is protein sequence and consensus sequence hybridization feature (Section 1.1.4 of [50]); PSSM8 and PSSM64 are features that extracted from position-specific scoring matrices (PSSMs); PTM is post-translational modification feature; TSU is tissue information feature; RF is random forest classifier; NB is Naïve Bayes classifier; MLP is multi-layer perceptron classifier and RBF is radial basis function network classifier.

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

340 341 342 343

346 347 348 349 350 351 352 353 354 Q8 355 356

t1:1 t1:2 t1:3 t1:4 t1:5 t1:6 t1:7 t1:8 t1:9 t1:10 t1:11 t1:12 t1:13

Table 1 Performance comparison between LocFuse, the best single classifier (BSC), five popular ensemble learning methods (AdaBoost, bagging, weighted voting (WV), decision template (DT) and stacking) and five current state-of-the-art PPI prediction methods (M1: Martin et al. [3]; M2: Shen et al. [4]; M3: Guo et al. [5]; M4: Liu et al. [27] and M5: Zahiri et al. [30]). The methods were evaluated in a 5-fold cross-validation using RP and SD datasets. Both SD and RP datasets include 12,175 interacting protein pairs and the same number of non-interacting pairs. Interactions were taken from HPRD, which were confirmed by at least two experiments. Non-interactions for RP were selected randomly from the confident non-interacting protein pair (NP) set; for SD these were selected from NP in a way such that each protein had the same number of occurrences as in the interactions. Standard deviations for all results are less than 0.02. In each column the highest result is highlighted in bold.

379 380 381 382 383 384

C

377 378

E

375 376

R

373 374

R

371 372

t1:14 t1:15 t1:16 t1:17 t1:18 t1:19 t1:20 t1:21 t1:22 t1:23 t1:24 t1:25 t1:26 t1:27 t1:28 t1:29 t1:30 t1:31 t1:32 t1:33 t1:34 t1:35 t1:36 t1:37 t1:38

RP

SD

O

369 370

Performance method

Precision

Recall

F-measure

Accuracy

AUC

BSC AdaBoost Bagging WV DT Stacking M1 M2 M3 M4 M5 LocFuse BSC AdaBoost Bagging WV DT Stacking M1 M2 M3 M4 M5 LocFuse

0.75 0.71 0.60 0.69 0.68 0.72 0.68 0.69 0.64 0.72 0.77 0.81 0.65 0.55 0.59 0.65 0.69 0.72 0.53 0.56 0.58 0.68 0.68 0.72

0.68 0.71 0.78 0.70 0.71 0.73 0.60 0.61 0.66 0.62 0.66 0.70 0.64 0.65 0.60 0.67 0.63 0.65 0.52 0.51 0.59 0.45 0.65 0.74

0.71 0.71 0.68 0.69 0.70 0.73 0.63 0.65 0.65 0.66 0.70 0.76 0.64 0.59 0.59 0.66 0.66 0.68 0.52 0.53 0.59 0.55 0.66 0.73

0.72 0.71 0.62 0.71 0.69 0.71 0.68 0.67 0.65 0.69 0.74 0.77 0.64 0.56 0.60 0.65 0.67 0.69 0.52 0.53 0.61 0.62 0.66 0.72

0.78 0.79 0.73 0.73 0.74 0.75 0.72 0.73 0.73 0.70 0.77 0.85 0.71 0.59 0.63 0.70 0.71 0.69 0.55 0.60 0.60 0.67 0.74 0.81

C

367 368

N

365 366

U

363 364

3.2.2. LocFuse competitive with state-of-the-art methods To compare LocFuse with the current state-of-the-art PPI prediction methods, we implemented five different approaches M1–M5 according to the previously published studies M1: Martin et al. [3]; M2: Shen et al. [4]; M3: Guo et al. [5]; M4: Liu et al. [27] and M5: Zahiri et al. [30]. As described in the papers, we used libsvm [71] for the classification with M1–M3 and for the classification with M4 we used the source code and the features available at the authors' website at (http://ifg.stat. sinica.edu.tw/PPI//CD_WWW/CD_PPI.html). The results show that for both SD and RP datasets LocFuse performed significantly better than any other method (F-measure, accuracy and AUC are 0.76, 0.77 and 0.85 in RP dataset and are 0.73, 0.72 and 0.81 in the SD dataset respectively), and the best of them is slightly better than our best single classifier.

393

F

385

3.2.1. LocFuse outperforms the best single classifier and popular ensemble learning methods Table 1 shows a comparison between LocFuse and other methods for both RP and SD datasets in prediction PPIs. For all methods, the predictions were evaluated in a 5-fold cross-validation. In comparison with the best single classifier (BSC), i.e. the best of the 32 mentioned base classifiers (MLP, NB, RBF and RF classifiers in the eight feature spaces while discarding localization information), LocFuse performed better for both datasets in all performance measures analyzed. We also compared LocFuse with five popular ensemble learning methods: AdaBoost, bagging, weighted voting, decision template and stacking (for descriptions of these methods refer to Kuncheva [46] and Witten et al. [45]). To run AdaBoost and bagging, firstly, features from all eight feature spaces were ranked according to their informationgain ratio and then 200 top ranked features were selected. For running weighted voting (WV), decision template (DT) and stacking we used the prediction of 32 base classifiers on the whole datasets. AdaBoost and bagging were run four times using MLP, NB, RBF and decision tree as base classifiers; stacking was run four times using MLP, NB, RBF and RF as meta classifiers. Also, for weighted voting we used different weight selection strategies; similarly for decision template various approaches have been tested. Finally in each mentioned methods the best result has been shown. In addition, the results show that the performances of the popular ensemble learning methods are comparable to the performance of the

O

361 362

R O

3.2. Comparison with other methods

386 387 388 389 390 391 392

394 395 396 397 398 399 400 401 402 403 404 405 406

3.2.3. Performance evaluation on the independent test set 407 To assess the prediction performances of methods M1–M5 and other 408 ensemble learning methods on the independent dataset (see Materials 409 and methods section) and compare them with LocFuse, we re-trained 410 Q9 all methods using the SD dataset. Since there were only 2800 protein 411 pairs in the test set for which all the features required for method M4 412 are available, we excluded this method from the analysis. LocFuse 413 outperformed all its competitors, reaching an accuracy of 0.85 (Table 2). 414

P

360

best single classifier (F-measure and accuracy for the best performing ensemble learning method and BSC were around 0.71 for the RP dataset; for the SD dataset, the F-measure and accuracy were around 0.68 for the best performing ensemble learning and 0.64 for BSC), and so they cannot work well for PPI prediction. This observation shows the complex nature of the prediction problem and the necessity of using biological information for classifier fusion.

T

359

datasets (see Figs S3–S10 and Tables 3–10 of [50] for the prediction performances when considering all proteins in RP and SD datasets discarding the clustering).

D

357 358

J. Zahiri et al. / Genomics xxx (2014) xxx–xxx

E

6

3.2.4. LocFuse is not biased towards positive decision Because there are no negative examples in the independent test set, the good performance of the proposed method may be due to the bias of LocFuse towards positive decision. So, we run the LocFuse on 1 million protein pairs that have not been tested in the experiments mentioned in the HPRD (such protein pairs have been determined according to experiment IDs). The results show that around 95% of the tested protein pairs, predicted as non-interacting proteins by LocFuse (the results can be downloaded from http://lbb.ut.ac.ir/ Download/LBBsoft/LocFuse/1-Million_Protein_Pairs.txt). Considering this result, we can strongly claim that LocFuse is not biased towards positive decision.

415 416

3.3. Future use of LocFuse

427

One of the advantages of the proposed ensemble learning method is in its efficiency with respect to computational cost: this is because the base classifiers can be trained in parallel and the combiner is a simple classifier (K-NN classifier in a one-dimensional feature space) that estimates the local competence of the classifiers to classify new instances. If the feature space for the combiner, which estimates the local competence of the classifiers, is properly determined then this ensemble learning method can be used successfully in other bioinformatics or even other unrelated machine learning problems.

428 429

4. Conclusion

437

In this study, due to the sheer complexity of the PPI prediction problem, we have proposed a classifier fusion method, LocFuse, to predict human PPI. Eight different feature types: post-translational modification; tissue information; codon usage; similarity based on GO annotations, as well as four PSSM-based features, were used to encode each protein pair. Four different base classifiers: random forest (RF); Naïve Bayes (NB); multi-layer perceptron (MLP) and radial basis function

438 439

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

417 418 419 420 421 422 423 424 425 426

430 431 432 433 434 435 436

440 441 442 443 444

J. Zahiri et al. / Genomics xxx (2014) xxx–xxx t2:1 t2:2 t2:3 t2:4 t2:5 t2:6

7

Table 2 Performance comparison between LocFuse, five popular ensemble learning methods (AdaBoost, bagging, weighted voting (WV), decision template (DT) and stacking) and five current state-of-the-art PPI prediction methods (M1: Martin et al. [3]; M2: Shen et al. [4]; M3: Guo et al. [5]; M4: Liu et al. [27] and M5: Zahiri et al. [30]) on the independent test set, which include 19,090 interactions that were taken from the 1846 human protein complexes in the CORUM database. Since there were only 2800 protein pairs in the test set for which all the features required for method M4 are available, we excluded this method from the analysis. Because all protein pairs are interacting, so the false positive is 0 for all methods and accuracy is the best measure for evaluation.

t2:7

Method

AdaBoost

Bagging

WV

DT

Stacking

M1

M2

M3

M4

M5

LocFuse

t2:8

Accuracy

0.76

0.79

0.74

0.77

0.75

0.68

0.57

0.75



0.72

0.85

462 463 464 465 466 467 468 469 470 471 472 473 474 475 476

479 480

C

460 461

E

U

We would like to thank Mohammad Javad Niroomand, Saman Hosseini Ashtiani and Omid Yaghoubi for their help; and to Dr. 481 Nowzari-Dalini and Ali Yousefi for partially providing computational fa482 cilities. We further thank Dr. Kavousi and Dr. Karimi-Jafari for taking 483 time to personally answer our questions. Special thanks are expressed 484 to Daniel Faria for kindly helping us with the GO semantic similarity 485 computation; to Dr. Rajesh Raju for answering our questions about the 486 Q10 HPRD database; and to Dr. Shinsheng Yuan and Chia Hsin Liu for an487 swering our questions about their work. Also, we would like to thank 488 all experimentalists and public database maintainers who provided 489 various data used in our study. 490

Appendix A. Supplementary data

491

Supporting information available: Supplementary methods and results that contain detailed descriptions about methods and results are freely available via the internet at http://lbb.ut.ac.ir/Download/ LBBsoft/LocFuse Availability: LocFuse was developed with a user-friendly graphic interface and it is freely available for non-commercial use for Linux, Mac

492 493 494 495 496

F O

[1] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, Y. Sakaki, A comprehensive twohybrid analysis to explore the yeast protein interactome, Proc. Natl. Acad. Sci. 98 (2001) 4569–4574. [2] B. Alberts, The cell as a collection of protein machines: preparing the next generation of molecular biologists, Cell 92 (1998) 291–294. [3] S. Martin, D. Roe, J. Faulon, Predicting protein–protein interactions using signature products, Bioinformatics 21 (2005) 218–226. [4] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, et al., Predicting protein–protein interactions based only on sequences information, Proc. Natl. Acad. Sci. 104 (2007) 4337–4341. [5] Y. Guo, L. Yu, Z. Wen, M. Li, Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences, Nucleic Acids Res. 36 (2008) 3025–3030. [6] Z.-P. Liu, L. Chen, Proteome-wide prediction of protein–protein interactions from high-throughput data, Protein Cell 3 (2012) 508–520. [7] Q.C. Zhang, D. Petrey, J.I. Garzón, L. Deng, B. Honig, PrePPI: a structure-informed database of protein–protein interactions, Nucleic Acids Res. 41 (2013) D828–D833. [8] J. Zahiri, J. Hannon Bozorgmehr, A. Masoudi-Nejad, Computational prediction of protein–protein interaction networks: algorithms and resources, Curr. Genomics 14 (2013) 397–414. [9] I. Gagnon-Arsenault, F.-C. Marois Blanchet, S. Rochette, G. Diss, A.K. Dubé, C.R. Landry, Transcriptional divergence plays a role in the rewiring of protein interaction networks after gene duplication, J. Proteome 81 (2013) 112–125. [10] M. Kolář, M. Lässig, J. Berg, From protein interactions to functional annotation: graph alignment in Herpes, BMC Syst. Biol. 2 (2008) 90. [11] S. Fields, O.-k. Song, A Novel Genetic System to Detect Protein–Protein Interactions, 1989. [12] P. Uetz, S. Fumagalli, D. James, R. Zeller, Molecular interaction between limb deformity proteins (formins) and Src family kinases, J. Biol. Chem. 271 (1996) 33525–33530. [13] P.J. Verschure, A.E. Visser, M.G. Rots, Step out of the groove: epigenetic gene control systems and engineered transcription factors, Adv. Genet. 56 (2006) 163–204. [14] A. Elefsinioti, Ö.S. Saraç, A. Hegele, C. Plake, N.C. Hubner, I. Poser, et al., Large-scale de novo prediction of physical protein–protein association, Mol. Cell. Proteomics 10 (2011). [15] J. Gillis, S. Ballouz, P. Pavlidis, Bias tradeoffs in the creation and analysis of protein– protein interaction networks, J. Proteome (2014). [16] M. Abu-Farha, M. Werther, H. Seitz, Protein–Protein Interaction, Springer, 2008. [17] P. Blohm, G. Frishman, P. Smialowski, F. Goebels, B. Wachinger, A. Ruepp, et al., Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis, Nucleic Acids Res. 42 (2014) D396–D400. [18] L. Zhu, Z.-H. You, D.-S. Huang, Increasing the reliability of protein–protein interaction networks via non-convex semantic embedding, Neurocomputing 121 (2013) 99–107. [19] Y.-K. Lei, Z.-H. You, T. Dong, Y.-X. Jiang, J.-A. Yang, Increasing reliability of protein interactome by fast manifold embedding, Pattern Recogn. Lett. 34 (2013) 372–379. [20] S. Orchard, S. Kerrien, S. Abbani, B. Aranda, J. Bhate, S. Bidwell, et al., Protein interaction data curation: the International Molecular Exchange (IMEx) consortium, Nat. Methods 9 (2012) 345–350. [21] G. Kuzu, A. Gursoy, R. Nussinov, O. Keskin, Exploiting conformational ensembles in modeling protein–protein interactions on the proteome scale, J. Proteome Res. 12 (2013) 2641–2653. [22] Q.C. Zhang, D. Petrey, L. Deng, L. Qiang, Y. Shi, C.A. Thu, et al., Structure-based prediction of protein–protein interactions on a genome-wide scale, Nature 490 (2012) 556–560. [23] J.O. Korbel, L.J. Jensen, C. Von Mering, P. Bork, Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs, Nat. Biotechnol. 22 (2004) 911–917. [24] P.R. Kensche, V. van Noort, B.E. Dutilh, M.A. Huynen, Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution, J. R. Soc. Interface 5 (2008) 151–170. [25] P.-Y. Chen, C.M. Deane, G. Reinert, Predicting and validating protein interactions using network structure, PLoS Comput. Biol. 4 (2008) e1000118. [26] Y. Hao, X. Zhu, M. Huang, M. Li, Discovering patterns to extract protein–protein interactions from the literature: part II, Bioinformatics 21 (2005) 3294–3300.

R O

458 459

R

456 457

R

454 455

N C O

452 453

References

P

Acknowledgment

450 451

D

478

448 449

OSX and Windows operating systems at: http://lbb.ut.ac.ir/Download/ 497 LBBsoft/LocFuse. Supplementary data to this article can be found online 498 at http://dx.doi.org/10.1016/j.ygeno.2014.10.006. 499

T

477

network (RBF) were trained in each feature space. The results demonstrated that if the proteome space is divided according to cellular localization of proteins, the capability of some classifiers can be substantially improved. We therefore, selected the most accurate classifier according to cellular localization in order to predict the PPIs. Human Protein Atlas (HPA) localization database was used to assign localization to proteins, for those proteins that lacked such information, we tried using the LocTree2 localization prediction tool. Based on the localization information the proteins were categorized into five clusters: Loc-Clusts: non-membrane-non-secretory (NMNS); non-membranesecretory (NMS); membrane-non-secretory (MNS); membrane secretory (MS) and Unknown (UNK). The majority of the proteins, and their interactions, belong to the NMNS cluster. These results show that PPI databases, such as HPRD, favor non-membrane proteins and these proteins can be better studied. To select confident positive interactions, we used those interactions in HPRD that were confirmed by at least two experimental methods, and for negative dataset selection we used two strategies to construct a couple of benchmark datasets for evaluating and comparing different PPI prediction methods. The prediction performance of LocFuse was significantly better than the state-of-the-art PPI prediction methods in the two benchmark datasets. In addition, the low performance of popular ensemble learning methods is indicative of the indispensability of using biological knowledge for classifier fusion. Based on these results, we can state that different features have different predictive power for PPI prediction between different Loc-Clusts; however in general, our novel PSSM-based features contribute most to PPI prediction. In comparison to other methods, the results of LocFuse reveal lower dependencies on benchmark dataset construction strategies, which can be interpreted as characterizing the enhanced robustness of LocFuse. The results also demonstrated the improved performance of LocFuse in human protein complex detection compared with other PPI prediction methods.

446 447

E

445

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 Q11 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565

D

P

R O

O

F

[50] J. Zahiri, M. Mohammad-Noori, R. Ebrahimpour, T. Goldberg, A. Masoudi-Nejad, Novel Features to Encode Proteins for Protein–Protein Interaction Prediction, Data Brief (2014) (Submitted for publication). [51] T.S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, et al., Human Protein Reference Database—2009 update, Nucleic Acids Res. 37 (2009) D767–D772. [52] S. Altschul, T. Madden, A. Schffer, J. Zhang, Z, W. Miller, D. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389-340. [53] A. Acland, R. Agarwala, T. Barrett, J. Beck, D.A. Benson, C. Bollin, et al., Database resources of the National Center for Biotechnology Information, Nucleic Acids Res. 41 (2013) D8–D20. [54] C. Pesquita, D. Faria, H. Bastos, A. Ferreira, A. Falcao, F. Couto, Metrics for GO based protein semantic similarity: a systematic evaluation, BMC Bioinformatics 9 (2008) S4. [55] D. Faria, C. Pesquita, F.M. Couto, A. Falcão, Proteinon: A Web Tool for Protein Semantic Similarity, 2007. [56] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, WileyIEEE Press, 2011. [57] M. Uhlen, P. Oksvold, L. Fagerberg, E. Lundberg, K. Jonasson, M. Forsberg, et al., Towards a knowledge-based human protein atlas, Nat. Biotechnol. 28 (2010) 1248–1250. [58] T. Goldberg, T. Hamp, B. Rost, LocTree2 predicts localization for all domains of life, Bioinformatics 28 (2012) i458–i465. [59] J. Yu, M. Guo, C. Needham, Y. Huang, L. Cai, D. Westhead, Simple sequence-based kernels do not predict protein–protein interactions, Bioinformatics 26 (2010) 2610–2614. [60] Y. Park, E.M. Marcotte, Flaws in evaluation schemes for pair-input computational predictions, Nat. Methods 9 (2012) 1134–1136. [61] Y. Park, E.M. Marcotte, Revisiting the negative example sampling problem for predicting protein–protein interactions, Bioinformatics 27 (2011) 3024–3028. [62] A. Ben-Hur, W.S. Noble, Choosing negative examples for the prediction of protein– protein interactions, BMC Bioinformatics 7 (Suppl. 1) (2006) S2. [63] X.W. Chen, J.C. Jeong, P. Dermyer, KUPS: constructing datasets of interacting and non-interacting protein pairs with associated attributions, Nucleic Acids Res. 39 (2011) D750–D754. [64] B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, et al., The BioGRID Interaction Database: 2008 update, Nucleic Acids Res. 36 (2008) D637–D640. [65] S. Kerrien, B. Aranda, L. Breuza, A. Bridge, F. Broackes-Carter, C. Chen, et al., The IntAct molecular interaction database in 2012, Nucleic Acids Res. 40 (2012) D841–D846. [66] L. Licata, L. Briganti, D. Peluso, L. Perfetto, M. Iannuccelli, E. Galeota, et al., MINT, the molecular interaction database: 2012 update, Nucleic Acids Res. 40 (2012) D857–D861. [67] I. Xenarios, E. Fernandez, L. Salwinski, X. Duan, M. Thompson, E. Marcotte, et al., DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions, Nucleic Acids Res. 30 (2002) 303–305. [68] A. Ruepp, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, M. Stransky, et al., CORUM: the comprehensive resource of mammalian protein complexes, Nucleic Acids Res. 36 (2008) D646–D650. [69] L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, et al., Prediction of human protein function from post-translational modifications and localization features, J. Mol. Biol. 319 (2002) 1257–1265. [70] Y. Ofran, G. Yachdav, E. Mozes, T.-t. Soong, R. Nair, B. Rost, Create and assess protein networks through molecular characteristics of individual proteins, Bioinformatics 22 (2006) e402–e407. [71] C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, ACM Transactions on Intelligent Systems and Technology (TIST), 22011. 27.

R

R

E

C

T

[27] C.H. Liu, K.-C. Li, S. Yuan, Human protein–protein interaction prediction by a novel sequence-based co-evolution method: co-evolutionary divergence, Bioinformatics 29 (2013) 92–98. [28] H.S. Najafabadi, R. Salavati, Sequence-based prediction of protein–protein interactions by means of codon usage, Genome Biol. 9 (2008) R87. [29] X. Lin, Chen Xw, Heterogeneous data integration by tree‐augmented naïve Bayes for protein–protein interactions prediction, Proteomics 13 (2013) 261–268. [30] J. Zahiri, O. Yaghoubi, M. Mohammad-Noori, R. Ebrahimpour, A. Masoudi-Nejad, PPIevo: protein–protein interaction prediction from PSSM based evolutionary information, Genomics 102 (2013) 237–242. [31] P. Minguez, L. Parca, F. Diella, D.R. Mende, R. Kumar, M. Helmer-Citterich, et al., Deciphering a global network of functionally associated post-translational modifications, Mol. Syst. Biol. 8 (2012). [32] H. Kontaki, I. Talianidis, Cross-talk between post-translational modifications regulates life or death decisions by E2F1, Cell Cycle 9 (2010) 3836–3837. [33] B.T. Seet, I. Dikic, M.-M. Zhou, T. Pawson, Reading protein modifications with interaction domains, Nat. Rev. Mol. Cell Biol. 7 (2006) 473–483. [34] J. Woodsmith, A. Kamburov, U. Stelzl, Dual coordination of post translational modifications in human protein networks, PLoS Comput. Biol. 9 (2013) e1002933. [35] A. Bossi, B. Lehner, Tissue specificity and the human protein interaction network, Mol. Syst. Biol. 5 (2009). [36] M.H. Schaefer, T.J. Lopes, N. Mah, J.E. Shoemaker, Y. Matsuoka, J.-F. Fontaine, et al., Adding protein context to the human protein–protein interaction network to reveal meaningful interactions, PLoS Comput. Biol. 9 (2013) e1002860. [37] O. Magger, Y.Y. Waldman, E. Ruppin, R. Sharan, Enhancing the prioritization of disease-causing genes through tissue specific protein interaction networks, PLoS Comput. Biol. 8 (2012) e1002690. [38] M. Kiran, H.A. Nagarajaram, Global versus local hubs in human protein–protein interaction network, J. Proteome Res. 12 (2013) 5436–5446. [39] H.S. Najafabadi, H. Goodarzi, R. Salavati, Universal function-specificity of codon usage, Nucleic Acids Res. 37 (2009) 7014–7023. [40] R. Saunders, C.M. Deane, Synonymous codon usage influences the local protein structure observed, Nucleic Acids Res. 38 (2010) 6719–6728. [41] T. Zhou, M. Weems, C.O. Wilke, Translationally optimal codons associate with structurally sensitive sites in proteins, Mol. Biol. Evol. 26 (2009) 1571–1580. [42] M.A. Harris, J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, et al., The Gene Ontology (GO) database and informatics resource, Nucleic Acids Res. 32 (2004) D258–D261. [43] S.R. Maetschke, M. Simonsen, M.J. Davis, M.A. Ragan, Gene Ontology-driven inference of protein–protein interactions using inducers, Bioinformatics 28 (2012) 69–75. [44] X. Wu, L. Zhu, J. Guo, D.-Y. Zhang, K. Lin, Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations, Nucleic Acids Res. 34 (2006) 2137–2150. [45] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2011. [46] L.I. Kuncheva, Combining pattern classifiers, Methods and Algorithms, Wiley. com, 2004. [47] M. Javadi, R. Ebrahimpour, A. Sajedin, S. Faridi, S. Zakernejad, Improving ECG classification accuracy using an ensemble of neural network modules, PLoS One 6 (2011) e24386. [48] R. Ebrahimpour, E. Kabir, M.R. Yousefi, Improving mixture of experts for viewindependent face recognition using teacher-directed learning, Mach. Vis. Appl. 22 (2011) 421–432. [49] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update. ACM SIGKDD explorations, Newsletter 11 (2009) 10–18.

O

566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622

J. Zahiri et al. / Genomics xxx (2014) xxx–xxx

E

8

U

N

C

680

Please cite this article as: J. Zahiri, et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information, Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2014.10.006

623 624 625 Q12 626 627 628 629 Q13 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679

LocFuse: human protein-protein interaction prediction via classifier fusion using protein localization information.

Protein-protein interaction (PPI) detection is one of the central goals of functional genomics and systems biology. Knowledge about the nature of PPIs...
1MB Sizes 3 Downloads 7 Views