Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches.

Research Article

Proteins: Structure, Function and Bioinformatics DOI 10.1002/prot.24564

Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches

Short title: Classification of protein-protein complexes

K Yugandhar and M. Michael Gromiha*

Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600036, Tamilnadu, India.

Key words: binding affinity, discrimination, feature selection, machine learning techniques, proteinprotein interactions.

*

corresponding author

Tel: +91-2257-4138 Fax: +91-2257-4102 E-mail: [email protected] This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process which may lead to differences between this version and the Version of Record. Please cite this article as an ‘Accepted Article’, doi: 10.1002/prot.24564 © 2014 Wiley Periodicals, Inc. Received: Jan 13, 2014; Revised: Mar 14, 2014; Accepted: Mar 14, 2014

PROTEINS: Structure, Function, and Bioinformatics

Page 2 of 27

Abstract: Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding affinity of protein-protein complexes is one of the challenging problems in computational and molecular biology. In this work, we related sequence features of protein-protein complexes with their binding affinities using machine learning approaches. We set up a database of 185 proteinprotein complexes for which the interacting pairs are heterodimers and their experimental binding affinities are available. On the other hand, we have developed a set of 610 features from the sequences of protein complexes and utilized Ranker search method, which is the combination of Attribute evaluator and Ranker method for selecting specific features. We have analyzed several machine learning algorithms to discriminate protein-protein complexes into high and low affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy of 76.1% with the combination of nine features using support vector machines. Further, we observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our method would serve as an effective tool for identifying the interacting partners in protein-protein interaction networks and human-pathogen interactions based on the strength of interactions.

2 John Wiley & Sons, Inc.

Page 3 of 27


Introduction: Many biological functions involve the formation of protein-protein complexes1,2 and it is an important prerequisite for two proteins to interact with each other in cell signaling pathways, regulation of metabolic pathways, immunologic recognition, DNA replication, progression through the cell cycle, and protein synthesis.3 Protein-protein interactions are also essential to make any significant biological change as a complex. Understanding the recognition mechanism and binding specificity of protein-protein complexes are challenging problems in molecular and computational biology. Protein-protein complexes can be classified into various types such as dimericmultimeric, homodimer-heterodimer, obligate-non obligate and transient-permanent based on different criteria such as the number and type of subunits involved, interaction time and biological significance of the complexes.1,4 Several studies have been carried out in recent times on different aspects of protein-protein interactions, which include the role of specific interactions5-7, understanding the recognition mechanism6, identifying the binding sites from protein structures8-13 and predicting the interaction sites from amino acid sequence.14-16 Further, prediction methods have been developed for identifying the interacting partners using protein structure17,18 and sequence information19-21. These methods are mainly based on

structural

similarity18, physico-chemical properties21, evolutionary information18 etc. Binding affinity of protein-protein complexes is one such parameter, which can be related to almost every functional aspect of the proteins. Experimentally, identification of interacting

protein-protein pairs can be done with yeast two-hybrid system, Förster/fluorescence resonance

energy transfer (FRET), surface plasmon resonance and isothermal calorimetry.22 The data on interacting pairs of proteins have been deposited in databases such as DIP23, BioGRID24 and 3 John Wiley & Sons, Inc.


Page 4 of 27

STRING.25 Further, tools such as PIPE226 provides a platform for integration and annotation of interaction data from various databases. Complex experimental setup and more time demanding protocols stress the necessity of computational methods that could give reliable information about interacting partners or binding affinity. On these directions, several computational methods have been developed to predict interacting protein partners, which are quite successful despite having some challenges.27,28 In the case of binding affinity methods, few structure based methods have been proposed using empirical scoring functions,29-32 knowledge based methods33-36 and quantitative structure activity relationship methods.37 These methods were mainly based on structural information and it is necessary to develop sequence based methods for annotating protein-protein interaction networks and identifying interacting partners at large scale.

In this work, we have systematically analyzed the sequences of interacting proteins and derived a set of 610 features. Using feature selection procedures, we have selected a set of nine features (attributes) and developed a model for discriminating protein-protein complexes based on their affinities. The selected features include predicted biding site residues, propensities for αhelices and β-sheets, which are reported to be important for the binding affinity of proteinprotein complexes.57-60 Then we systematically analyzed the contribution of those selected features for discriminating protein-protein complexes based on their binding affinities. Our method using support vector machines could discriminate 155 protein-protein complexes of high and low affinities with a 10-fold cross-validation accuracy of 76.1%. Further, our method was tested with a set of 30 complexes, which showed an accuracy of 83.3%. We suggest that our method could be effectively used for identifying interacting partners with low and high affinities in protein-protein interaction networks and host-pathogen interactions.


Page 5 of 27


Materials and Methods Dataset: We have compiled a dataset of 185 protein-protein complexes for the present study with the following conditions: (i) the experimental binding affinity (Kd value) is known,33,38-40 (ii) both the binding partners of a complex have more than 50 amino acids each and (iii) the complexes are heterodimers. The dataset include protein-protein complexes with diverse functions (antigen-antibody, enzyme-inhibitor, G-protein containing, receptor containing etc.), various ranges of molecular weights and disordered regions. These 185 complexes have been classified into two groups based on their binding affinities. The complexes with Kd less than 10-8 M were considered as high affinity class and complexes with Kd value greater than or equal to 10-8 M were considered as low affinity class. The Kd range for the high affinity class is the one generally considered for permanent protein-protein complexes,41 which emphasizes the biological importance of our model. With this criterion, we have obtained a balanced dataset in which, 98 and 87 protein-protein complexes have been assigned under high and low affinity classes, respectively. The Protein Data Bank (PDB) codes42 for these two sets of complexes are given in Table I and the description for all the 185 complexes is given in supplementary Table S1.

Features We have utilized a set of 610 sequence based features in this study. The features include a diverse set of 544 features that account for various physico-chemical, conformational, energetic and biochemical properties of amino acids obtained from AAindex database43 as well



Page 6 of 27

as 49 properties from the literature.44 In addition, we have used 17 features from the information on predicted binding site residues, predicted aromatic and charged residues at the interface15 and predicted solvent accessibility.45 All those features have been computed for all the considered 185 protein-protein complexes from their amino acid sequences. Further, we have reduced the number of features as discussed below.

Machine learning methods: WEKA Data mining software46 was used for machine learning tasks constituting feature selection and classification. We have analyzed various machine learning techniques implemented in WEKA platform for discriminating protein-protein complexes based on their binding affinity. WEKA includes several methods based on neural networks, regression analysis, Bayes function, logistic functions, nearest neighbor methods, meta learning, decision trees and rules. Based on the performance of all the techniques on different feature

sets using experimenter module in WEKA, We selected SMO (Sequential Minimal

Optimization) algorithm,47 which is a SVM based method for the classification of complexes

in our dataset. The SVM is a learning machine for two-group classification problems that

transforms the attribute space into multidimensional feature space using a kernel function to separate dataset instances by an optimal hyperplane.48 We have used feature selection methods available in WEKA46 and a program available at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm.49 WEKA provides various attribute and subset evaluator methods such as CFS,50 Chi-squared, Classifier, SVM,51 Infogain, ReliefF,52 Gain ratio, Consistency53 and so on as well as various search


Page 7 of 27


methods including Best-first, Genetic search54 and Ranker. Brief description of each of the above mentioned methods available in WEKA is given below.

CFS: Considers the individual predictive ability of each feature along with the degree of redundancy between them. Chi-squared: Computes the value of the chi-squared statistic with respect to the class. Classifier: Evaluates attribute subsets on training data or a separate hold out testing set. It uses a classifier to estimate the ‘merit’ of a set of attributes. SVM: Evaluates based on SVM-RFE i.e. “Recursive feature elimination”. Infogain: Measures the information gain with respect to the class. ReliefF: Evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. Gain ratio: Measures the gain ratio with respect to the class. Consistency: Evaluates the worth of a subset of attributes by the level of consistency in the class values when the training instances are projected onto the subset of attributes. Best-first: Searches by greedy hill-climbing augmented with a back-tracking facility. Genetic: Performs a search using the simple genetic algorithm. Ranker: Ranks the attributes by their individual evaluations.



Page 8 of 27

Feature set reduction and selection The feature set was reduced in order to remove redundancy by employing "Dimensionality reduction by correlation" criteria. The refined feature set contains 216 amino acid properties, which has the absolute r-value of less than 0.85 between any two considered properties. We have added 17 more features from the information on predicted interface residues and solvent accessibility,15,45 which resulted in a total of 233 features. Further, the best features, which contribute for the discrimination, were selected by employing Ranker search method (combination of SVM attribute evaluator and ranker methods) in WEKA software.

Assessment of discrimination performance and validation procedures We have used n-fold cross validation procedure for evaluating the performance of the method. In this procedure n-1 data have been utilized to develop a model and the rest of the data were used to test the method. The prediction performance has been assessed with the following measures: Accuracy = (TP+TN)/(TP+TN+FP+FN)

(1)

Sensitivity (or) Recall = ΤP/(TP+FN)

(2)

Specificity = TN/(TN+FP)

(3)

Precision = TP/(TP+FP)

(4)

F-measure = 2 x ((Precision x Recall)/(Precision + Recall))

(5)

In these equations, TP, TN, FP and FN, represent, true positives, true negatives, false positives and false negatives, respectively. In addition, AUC (Area under the ROC curve) has been estimated for the correspondence between true positive rate and false positive rate.


Page 9 of 27


Results and discussion Selected features for discrimination We have tried various combinations of evaluator and selection methods available in WEKA software to delineate the best features for discriminating protein-protein complexes with high and low affinities. We observed that the Ranker search method showed the best performance using SVM attribute evaluator, which analyzes all the considered features and arranges them in the order of priority for discrimination. The usage of all the features showed an average accuracy of about 70%, which also causes the problem of over-fitting as the number of features are very high compared with the number of data used for the present study. Hence, we removed the features one by one and evaluated the performance in terms of accuracy and ROC. We noticed a marginal increase of prediction accuracy with the elimination of different features. Finally, we have identified a set of 9 features, which showed the maximum accuracy with 10-fold and 3-fold cross validation tests. The selected properties are weights for α-helix at the window position of -6,55 β-sheet at the window positions of -6, -3 and 5,55 principal property value z2 showing side chain bulkiness,56 number of predicted binding site residues in receptors,18 number of predicted binding site aromatic and positively charges residues in receptors and ligands18 and percentage of binding site aromatic and positively charged residues in ligands.18 Among the final list of 9 features, more than 44% (4 features) are selected from predicted binding site residues. Interestingly, the information on the aromatic and positively charged residues at the binding sites is identified as one of the most important features for discriminating the protein-protein complexes based on their affinities. This observation agrees



Page 10 of 27

well with the previous results reported in the literature57,58 and emphasizes the importance of binding site residues and especially aromatic and positively charged residues in the binding sites in governing the binding affinity between two interacting proteins. In addition, secondary structure based properties play an important role for discrimination along with the physical property, side chain bulkiness. It has been shown that induction of α−helical structure in Calmodulin-binding sequence of a target protein is an important step in the activation of target enzymes which in turn could be a determining factor for binding affinity.59,60 The selected feature set for our model consists of a measure of helix propensity as one of its features. These results emphasize the importance of α−helices in formation of protein-protein complexes and governing the binding affinity.

Analysis of selected features for discrimination We have classified the protein-protein complexes into two groups based on their affinities and analyzed the distribution of all the nine selected features. Few specific examples are discussed below: Figure 1(A) shows the distribution of protein-protein complexes with high and low affinities based on the number of predicted aromatic and positively charged residues at the interface of ligands. We noticed that the number of these residues is less in low affinity complexes compared with the complexes of high affinity. The high affinity protein-protein complexes have more number of positively charged and aromatic residues at the interface compared with complexes of low affinities. With the cutoff of 9 residues, the percentage of high and low affinity complexes is 23% and 6%, respectively. Interestingly, this parameter is selected as one of the features for discriminating high and low affinity complexes, which are


Page 11 of 27


also reported to be an important factor for understanding the binding specificity of proteinprotein complexes.57,58 Figure 1(B) shows the weights for β-sheet at the window position of -6 and used in predicting protein secondary structures.55 We noticed that more number of high affinity complexes have the weights of less than zero. We observed a similar trend for the weights to α-helix. These results reveal the importance of secondary structures for the specificity of protein-protein complexes in agreement with experimental reports.60 Other selected properties also showed marked differences between low and high affinity complexes (data not shown).

Discrimination of protein-protein complexes based on their affinities We have utilized different algorithms available in WEKA to discriminate the proteinprotein complexes based on their affinities, and the SMO method (which uses support vector machines) showed the best performance based on sensitivity, specificity, accuracy and ROC. Further, we have varied the adjustable parameters in SVM, and the model with Polynomial kernel and C value of 1.0 yielded the highest accuracy. The discrimination performance using different datasets is presented in Table II. Our method could discriminate low and high affinity protein-protein complexes with an average accuracy of 76.1% using 10-fold cross-validation on a set of 155 complexes. The sensitivity and specificity are 75.6% and 76.7%, respectively. We have applied the same model to a test set of 30 complexes and the discrimination accuracy is 83%. Further, we have tested the problem of over-fitting by evaluating the model with selfconsistency and the results are very much similar to that of the cross validation experiments. This observation verifies that there is no over-fitting factor in our model. It is noteworthy that the



Page 12 of 27

model was developed with a limited set of 185 complexes and it can be refined with the availability of more number of data on protein-protein binding affinity. Influence of sequence redundancy for discriminating high and low affinity complexes Protein-protein binding affinity is influenced by several experimental factors such as protein concentration, pH, temperature etc. In addition, mutation of a single amino acid residue could drastically change the affinity of the complexes.61,62 Hence, we have not considered the redundancy criteria and used all the 185 complexes in the present work. The performance of our model on a blind data set emphasizes that it is robust and no over fitting is associated with it. For further evaluation, we have developed a non-redundant datasets using the cutoff of less than 25% sequence identity in (i) receptor (ii) ligand and (iii) receptor or ligand.63 This yielded a set of 92, 125 and 144 protein-protein complexes based on the non-redundancy in receptor, ligand and either of them, respectively. Our method could discriminate the high and low affinity complexes in these three datasets with the accuracy of 64%, 77% and 75%, respectively. Analyzing performance of the model on a particular family of proteins apart from the test set Apart from the test set of 30 complexes, we have examined the prediction power of our model on an additional blind set of seven complexes, which belongs to a common group called “Tumor necrosis factor (TNF) superfamily”.64 Among the seven complexes, three of them have high affinity and four have low affinity. Our method correctly classified all the three high affinity complexes, and three out of the four low affinity complexes, which showed an accuracy of 85.7%. 12 John Wiley & Sons, Inc.

Page 13 of 27


Performance of the method on disordered proteins

Dosztanyi et al.65 reported that the hub proteins contain great proportion of disordered

regions and they tend to have long sequences. We have analyzed influence of disordered proteins

(or regions) on the affinities of hetero-dimeric complexes. Among the 185 complexes used in the

present study, structures of free proteins are available for 138 complexes and 90 of them have at least one protein in the disordered state. The analysis of these 90 complexes based on their

binding affinities showed that 60% of them (54 complexes) are of low affinity. Our model could

correctly classify 77% of all the disordered complexes using 10-fold cross-validation, which is

similar to the performance on the whole dataset of 185 complexes. It has been reported that the ordered and intrinsically unstructured complexes mainly differ in the interface properties.66 Interestingly, our method selected four features (from the list of 233 features), which are related to interface properties (derived from predicted binding site residues). Hence, these properties might play key role in differentiating ordered and disordered complexes as well as into high and low affinity complexes. Further, we have divided the set of 138 complexes into two groups (disordered with 90 complexes and ordered with 48 complexes) and performed feature selections separately with the aim of achieving the highest accuracy. We found that most of the features found in the two sets are similar except few features such as “charge of the protein”, which is selected only for ordered proteins. It supports the previous observation suggesting the importance of electrostatic and cation-π interactions for the recognition of protein-protein complexes.7 In addition, we noticed that 7 of the 11 features including interface properties selected for the disordered set are also present in the final list derived for 155 complexes. This reiterates the importance of the reduced set of the features developed in this work.



Page 14 of 27

Analysis of large-scale protein-protein interacting pairs based on high and low affinity We have employed our method for analyzing protein-protein interaction data available in major databases. We have collected a set of 4712 protein-protein interactions in yeast that are deposited in DIP database.23 Our analysis showed that 43% and 57% of the interacting pairs are with high and low affinity, respectively. The predicted high affinity complexes will be helpful to select the targets in structure based drug design. Further analysis of protein-protein interactions in various organisms, host-pathogen interactions and validations is in progress. Rational comparison of different attribute selection methods We have employed different combinations of attribute evaluators and search methods available in WEKA for feature selection process and the results are presented in Table III. From this table, we observed that the combination of SVM attribute evaluator and Ranker search method has the best performance using a minimum number of nine features. Other combinations of various attributor evaluator and search methods showed either less AUC or utilized more number of features than SVM attribute evaluator and Ranker search method. In addition, we have

used

random

forest

method

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm

available for

selecting

at the

features,49 which showed an AUC of 0.74 (and the accuracy is 73.5%) using 24 features. Hence, we have selected the combination of SVM attribute evaluator and Ranker search method selecting the features to discriminate high and low affinity protein-protein complexes.


Page 15 of 27


Comparison of different classifiers We have compared the performance of different classifiers for discriminating high and low affinity protein-protein complexes and the results for 7 typical methods are presented in Table IV. Most of the methods discriminated the low and high affinity protein-protein complexes with an accuracy in the range of 62% to 72%. The present method based on support vector machines could discriminate them with an accuracy of 76.1%, which is a balance between the sensitivity (75.6%) and specificity (76.7%). Influence of specific features for discrimination We have evaluated the importance of all the selected features for discrimination by removing a specific feature from the list and analyzed the accuracy using 10-fold crossvalidation on a dataset of 155 complexes. The results are presented in Table V. We noticed that the accuracy decreases 1 to 6% by removing a single feature. Specifically, percentage of aromatic and positively charged residues in predicted binding sites of ligands decreased the accuracy from 76.1% to 71.6% showing its importance in discrimination. Comparison with other methods The present work is the first sequence based method for classifying protein-protein complexes based on their binding affinity. This method is different from other structure based methods proposed in the literature, which are mainly for predicting the absolute binding affinity of protein-protein complexes. These methods have several limitations: (i) applicable only to a training set of complexes,35 (ii) utilizes a large number of descriptors,37 (iii) show high correlation only to rigid complexes35 and (iv) the requirement of structural information.28 On the other hand, the present method has several advantages: (i) the features are derived from amino 15 John Wiley & Sons, Inc.


Page 16 of 27

acid sequences, (ii) utilized a limited number of features, (iii) classifies into low and high affinity protein-protein complexes and (iv) shows a good performance. Although direct comparison of our method with other existing methods is not appropriate, the analysis shows that the present method has several advantages over other methods reported in the literature.

Conclusion The analysis on a large number of amino acid features, which are influencing the binding affinity of protein-protein complexes showed that the conformational properties, α-helical and βstrand tendencies, bulkiness and the number of predicted aromatic and charged residues at the protein-protein interface are important for discriminating protein-protein complexes of high and low affinities. Interestingly, the dominance of aromatic and charged residues at the interface are important for recognition due to the formation of electrostatic, aromatic-aromatic and cation-π interactions, which are reported to play vital roles for the formation of protein-protein complexes. In addition, the features related with protein secondary structures are also shown to play an important role in recognition, which are identified by our feature selection methods. The combination of these features could successfully discriminate the high and low affinity proteinprotein complexes with an accuracy in the range of 76-85% using different sets of data and validations procedures. Hence, the present method could be used to identify the interacting partners in protein-protein interaction networks and human-pathogen interactions based on their affinities. Further, the work on predicting the binding affinity from amino acid sequence is in progress.


Page 17 of 27


Acknowledgements We thank the Associate Editor and reviewers for constructive comments. KY thanks the University Grants Commission (UGC), Government of India for providing research fellowship. We thank the Bioinformatics facility and Indian Institute of Technology Madras for computational facilities. The work was partially supported by the Department of Science and Technology, Government of India to MMG (SR/SO/BB-0036/2011).

Supportive/Supplementary Material Table S1: Description for all the complexes used in the study. References: 1. Jones S, Thornton JM. Principles of protein-protein interactions. Proc Natl Acad Sci USA 1996;93:13-20. 2. Nooren IM, Thornton JM. Diversity of protein-protein interactions. EMBO J 2003;22:34863492. 3. Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson JD. Molecular Biology of the Cell. NewYork: Garland; 1989, 2nd edn. 4. Keskin O, Gursoy A, Ma B, Nussinov R. Principles of protein-protein interactions: what are the preferred ways for proteins to interact? Chem Rev 2008;108:1225-1244. 5. Bahadur RP, Chakrabarti P, Rodier F, Janin J. A dissection of specific and non-specific protein-protein interfaces. J Mol Biol 2004;336:943-955. 6. Gromiha MM. Protein Bioinformatics: From Sequence to Function. Elsevier; 2010. 7. Gromiha MM, Yokota K, Fukui K. Energy based approach for understanding the recognition mechanism in protein-protein complexes. Mol Biosyst 2009;5:1779-1786. 8. Jones S, Thornton JM. Prediction of protein-protein interaction sites using patch analysis. J Mol Biol 1997;272:133-43. 9. Neuvirth, H.; Raz, R.; Schreiber, G. ProMate: a structure based prediction program to identify the location of protein–protein binding sites. J Mol Biol 2004;338:181-199. 10. Fernandez-Recio J, Totrov M, Abagyan R. Identification of protein-protein interaction sites from docking energy landscapes. J Mol Biol 2004;335:843-865. 11. Fernandez-Recio J, Totrov M, Skorodumov C, Abagyan R. Optimal docking area: a new method for predicting protein–protein interaction sites. Proteins 2005;58:134-143. 12. La D, Kihara D. A novel method for protein-protein interaction site prediction using phylogenetic substitution models. Proteins 2012;80:126-141. 13. La D, Kong M, Hoffman W, Choi YI, Kihara D. Predicting permanent and transient proteinprotein interfaces. Proteins 2013;81(5):805-818. 17 John Wiley & Sons, Inc.


Page 18 of 27

14. Ofran Y, Rost B. Predict protein-protein interaction sites from local sequence information. FEBS Lett 2003;544:236-239. 15. Ofran Y, Rost B. ISIS: interaction sites identified from sequence. Bioinformatics 2007;23:e13-6. 16. Ahmad S, Mizuguchi K. Partner-Aware Prediction of Interacting Residues in Protein-Protein Complexes from Sequence Data. PLoS ONE 2011;6(12):e29104. 17. Shoemaker BA, Panchenko AR. Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. Plos Comput Biol 2007;3:595-601. 18. Tuncbag N, Gursoy A, Keskin O. Prediction of protein-protein interactions: unifying evolution and structure at protein interfaces. Phys Biol 2011;8:035006. 19. Martin S, Roe D, Faulon JL. Predicting protein–protein interactions using signature products. Bioinformatics 2005;21(2):218-226. 20. Pan XY, Zhang YN, Shen HB. Large-Scale Prediction of Human Protein-Protein Interactions from Amino Acid Sequence Based on Latent Topic Features. J Proteome Res 2010;9(10):4992-5001. 21. Zhang YN, Pan XY, Huang Y, Shen HB. Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. J Theor Biol 2011;283(1):44-52. 22. Phizicky EM, Fields S. Protein-protein interactions: methods for detection and analysis. Microbiol Rev 1995;59:94-123. 23. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res 2004;32(suppl 1):D449-D451. 24. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006;34(suppl 1):D535-D539. 25. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, von Mering C. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 2011;39(suppl 1):D561-D568. 26. Ramos H, Shannon P, Brusniak MY, Kusebauch U, Moritz RL, Aebersold R. The Protein Information and Property Explorer 2: Gaggle‐like exploration of biological proteomic data within one webpage. Proteomics 2001;11(1):154-158. 27. Wass MN, David A, Sternberg MJ. Challenges for the prediction of macromolecular interactions. Curr Opin Struct Biol 2011;21:382-390. 28. Kastritis PL, Bonvin AMJJ. On the binding affinity of macromolecular interactions: daring to ask why proteins interact. J R Soc Interface 2013;10:20120835. 29. Horton N, Lewis M. Calculation of the free energy of association for protein complexes. Protein Sci 1992;1:169-181. 30. Ma XH, Wang CX, Li CH, Chen WZ. A fast empirical approach to binding free energy calculations based on protein interface information. Protein Eng 2002;15:677-681. 31. Audie J, Scarlata S. A novel empirical free energy function that explains and predicts protein–protein binding affinities. Biophys Chem 2007;129:198-211. 32. Jiang L, Gao Y, Mao F, Liu Z, Lai L. Potential of mean force for protein-protein interaction studies. Proteins 2002;46:190-196. 33. Zhang C, Liu S, Zhu Q, Zhou Y. A knowledge-based energy function for protein-ligand, protein-protein, and protein-DNA complexes. J Med Chem 2005;48:2325-2335.


Page 19 of 27


34. Su Y, Zhou A, Xia X, Li W, Sun Z. Quantitative prediction of protein-protein binding affinity with a potential of mean force considering volume correction. Protein Sci 2009;18:2550-2558. 35. Moal IH, Agius R, Bates PA. Protein-protein binding affinity prediction on a diverse set of structures. Bioinformatics 2011;27:3002-3009. 36. Vreven T, Hwang H, Pierce BG, Weng Z. Prediction of protein-protein binding free energies. Protein Sci 2012;21:396-404. 37. Tian F, Lv Y, Yang L. Structure-based prediction of protein-protein binding affinity with consideration of allosteric effect. Amino Acids 2012;43:531-543. 38. Kastritis PL, Bonvin AM. Are scoring functions in protein-protein docking ready to predict interactomes? Clues from a novel binding affinity benchmark. J Proteome Res 2010;9:2216-2225. 39. Kastritis PL, Moal IH, Hwang H, Weng Z, Bates PA, Bonvin AM, Janin J. A structure-based benchmark for protein-protein binding affinity. Protein Sci 2011;20:482-491. 40. Nooren IM, Thornton JM. Structural characterisation and functional significance of transient protein–protein interactions. J Mol Biol 2003;325:991-1018. 41. Perkins JR, Diboun I, Dessailly BH, Lees JG, Orengo C. Transient protein-protein interactions: structural, functional, and network properties. Structure 2010;18:1233-43. 42. Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, Green RK, Goodsell DS, Prlic A, Quesada M, Quinn GB, Ramos AG, West-brook JD, Young J, Zardecki C, Berman HM, Bourne PE. The rcsb protein data bank: new resources for research and education. Nucleic Acids Res 2013;41:D475-D482. 43. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2008;36: D202-D205. 44. Gromiha, MM. A statistical model for predicting protein folding rates from amino acid sequence with structural class information. J. Chem. Inf. Model. 2005;45:494-501.

45. Garg A, Kaur H, Raghava GP. Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure information. Proteins 2005;61:318-24.

46. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data Mining Software: An Update; SIGKDD Explorations. 2009;11(1):10-18. 47. Platt JC. Fast Training of Support Vector Machines using Sequential Minimal Optimization. Microsoft Research 2000;12:41-65. 48. Hearst MA. Support Vector Machines. IEEE INTELLIGENT SYSTEMS 1998;18-28. 49. Breiman L. Random forests. 2001; Available at http://oz.berkeley.edu/users/breiman/randomforest2001.pdf. 50. Hall MA. Correlation-Based Feature Selection for Machine Learning, PhD thesis, Dept. of Computer Science, Univ of Waikato, Hamilton, New Zealand, 1998. 51. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning 2002;46:389-422. 52. Sikonja M, Kononenko I. An Adaptation of Relief for Attribute Estimation in Regression, Proceedings of 14th International Conference on Machine Learning (ICML ',97), Nashville, TN, USA, July 8-12 1997;pp.296-304.



Page 20 of 27

53. Liu H, Setiono R. A probabilistic approach to feature selection - A filter solution, 13th International Conference on Machine Learning, Bari, Italy, July 3-6, 1996;319-327. 54. Goldberg DE. Genetic Algorithms in Search, Optimization and Machine Learning, AddisonWesley, 1989. 55. Qian N, Sejnowski T. Predicting the secondary structure of globular proteins using Neural Network models. J Mol Biol 1988;202:865-884. 56. Wold S, Eriksson L, Hellberg S, Jonsson, J, Sjöström M, Skagerberg B, Wikström C. Principal property values for six non-natural amino acids and their application to a structureactivity relationship for oxytocin peptide analogues. Can J Chem 1987;65:1814-1820. 57. Gromiha MM, Selvaraj S, Jayaram B, Fukui K. Identification and analysis of binding site residues in protein complexes: energy based approach. Lecture notes in comp sci 2010;6215:626-633. 58. Gromiha MM, Saranya N, Selvaraj S, Jayaram B, Fukui K. Sequence and structural features of binding site residues in protein-protein complexes: comparison with protein-nucleic acid complexes. Proteome Sci 2011;9:S13. 59. Yuan T, Walsh MP, Sutherland C, Fabian H, Vogel HJ. Calcium-dependent and independent interactions of the calmodulin-binding domain of cyclic nucleotide phosphodiesterase with calmodulin. Biochemistry 1999;38:1446-1455. 60. Brokx RD, Lopez MM, Vogel HJ, Makhatadze GI. Energetics of target peptide binding by calmodulin reveals different modes of binding. J Biol Chem 2001;276:14083-14091. 61. Thorn KS, Bogan AA. ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics 2001;17:284-285. 62. Kumar MDS, Gromiha MM. PINT: Protein-protein Interactions Thermodynamic Database. Nucl Acids Res 2006;34:D195-198. 63. Wang G, Dunbrack Jr. RL. PISCES: a protein sequence culling server. Bioinformatics 2003;19(12):1589-91. 64. Day ES, Cote SM, Whitty A. Binding efficiency of protein-protein complexes. Biochemistry 2012;51:9124-9136. 65. Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P. Disorder and sequence repeats in hub proteins and their implications for network evolution. J Proteome Res 2006;5(11):29852995. 66. Mészáros B, Tompa P, Simon I, Dosztányi Z. Molecular principles of the interactions of disordered proteins. J Mol Biol 2007;372(2):549-561.


Page 21 of 27


Figure Legends Figure 1: Analysis of selected features for their discriminative ability Figure 1(A): Number of aromatic and positively charged residues in predicted binding sites of ligands Figure 1(B): Weights for β-sheet at the window position of -6



Page 22 of 27

Table I List of PDB codes for the protein-protein complexes with high and low affinities

Class High affinity (98 complexes)

Low affinity (87 complexes)

PDB IDs with chains of the two interacting proteins 1ACB_E:I, 1AHW_AB:C, 1ATN_A:D, 1AVX_A:B, 1AY7_A:B, 1BJ1_HL:VW, 1BRS_A:D, 1BVN_P:T, 1DFJ_E:I, 1DQJ_AB:C, 1EAW_A:B, 1EER_A:BC, 1EMV_B:A, 1EZU_C:AB, 1F34_A:B, 1FLE_E:I, 1FSK_BC:A, 1GPW_A:B, 1GXD_A:C, 1HCF_AB:X, 1I2M_A:B, 1IBR_A:B, 1IQD_AB:C, 1JIW_P:I, 1JPS_HL:T, 1JTG_A:B, 1K5D_AB:C, 1KXP_A:D, 1KXQ_A:H, 1M10_A:B, 1MAH_A:F, 1NB5_AP:I, 1NCA_HL:N, 1NSN_HL:S, 1OC0_A:B, 1OPH_A:B, 1P2C_AB:C, 1PXV_A:C, 1R0R_E:I, 1RV6_VW:X, 1T6B_X:Y, 1UUG_A:B, 1VFB_AB:C, 1WDW_BD:A, 1WEJ_HL:F, 1YVB_A:I, 1ZLI_A:B, 2ABZ_B:E, 2B42_A:B, 2GOX_A:B, 2HRK_A:B, 2I25_N:L, 2I9B_E:A, 2J0T_A:D, 2JEL_HL:P, 2NYZ_AB:D, 2O3B_A:B, 2OUL_A:B, 2OZA_B:A, 2PTC_E:I, 2SIC_E:I, 2SNI_E:I, 2UUY_A:B, 2VDB_A:B, 2VIR_AB:C, 3BP8_AB:C, 3SGB_E:I, 1AVW_A:B, 1BQL_LH:Y, 1BTH_HL:P, 1CSE_E:I, 1FDL_HL:Y, 1FSS_A:B, 1HWG_A:C, 1IGC_HL:A, 1JHL_HL:A, 1PPF_E:I, 1STF_E:I, 1TBQ_JK:S, 1TEC_E:I, 1TPA_E:I, 1YQV_HL:Y, 2KAI_AB:I, 3HFM_HL:Y, 3HHR_A:C, 4HTC_HL:I, 4SGB_E:I, 4TPI_Z:I, 1BGX_HL:T, 1BKD_R:S, 1CGI_E:I, 1N8O_ABC:E, 1RRP_A:B, 1Y64_A:B, 2FD6_HL:U, 2SEC_E:I, 2TPI_ZI:S, 7CEI_A:B 1A2K_AB:C, 1AK4_A:D, 1AKJ_AB:DE, 1AVZ_B:C, 1B6C_A:B, 1BUH_A:B, 1BVK_DE:F, 1CBW_ABC:D, 1E4K_AB:C, 1E6E_A:B, 1E6J_HL:P, 1E96_A:B, 1EFN_A:B, 1EWY_A:C, 1F6M_A:C, 1FC2_C:D, 1FFW_A:B, 1FQJ_A:B, 1GCQ_B:C, 1GLA_G:F, 1GRN_A:B, 1H1V_A:G, 1H9D_A:B, 1HE8_A:B, 1I4D_AB:D, 1IB1_AB:E, 1IJK_BC:A, 1JMO_A:HL, 1JWH_CD:A, 1KAC_A:B, 1KKL_ABC:H, 1KLU_AB:D, 1KTZ_A:B, 1LFD_B:A, 1MLC_AB:E, 1MQ8_A:B, 1NVU_Q:S, 1NVU_R:S, 1NW9_B:A, 1PVH_A:B, 1QA9_A:B, 1R6Q_A:C, 1RLB_ABCD:E, 1S1Q_A:B, 1US7_A:B, 1WQ1_G:R, 1XD3_A:B, 1XQS_A:C, 1Z0K_A:B, 1ZHI_A:B, 1ZM4_A:B, 2A9K_A:B, 2AJF_A:E, 2AQ3_A:B, 2B4J_AB:C, 2BTF_A:P, 2C0L_A:B, 2FJU_B:A, 2HLE_A:B, 2HQS_A:H, 2MTA_HL:A, 2OOB_A:B, 2OOR_AB:C, 2PCB_A:B, 2PCC_A:B, 2TGP_I:Z, 2VIS_AB:C, 2WPT_B:A, 3BZD_A:B, 3CPH_G:A, 1A0O_A:B, 1DKG_AB:D, 1GUA_A:B, 1MDA_LH:A, 1MEL_M:B, 1NMB_HL:N, 1YCS_A:B, 1AZS_AB:C, 1DE4_AB:CF, 1EFU_A:B, 1FAK_HL:T, 1GHQ_A:B, 1GP2_A:BG, 1HE1_C:A, 1R8S_A:E, 1SBB_A:B, 1TMQ_A:B, 2OT3_B:A

1

John Wiley & Sons, Inc.

Page 23 of 27


Table II Performance of the present model generated using SMO with selected nine features Precision 0.78 0.78

AUC 0.78 0.77

Dataset 185 complexes 185 complexes

Validation As full training set 10-fold cross-validation

Accuracy (%) 77.3 77.3

Sensitivity (%) 74.5 75.5

Specificity (%) 80.5 79.3

F-measure 0.77 0.77

155 complexes

As full training set

155 complexes 155 complexes 30 complexes

10-fold cross-validation 3-fold cross-validation Test set

76.8 76.1

76.8 75.6

76.7 76.7

0.77 0.76

0.77 0.76

0.77 0.76

76.8 83.3

74.4 81.3

79.5 85.7

0.77 0.83

0.77 0.84

0.77 0.84

2



Page 24 of 27

Table III

Comparison of different attribute selection methods using SMO as the classifier using 10-fold cross-validation on 185 complexes

Selection method (attribute evaluator + Search method)

Random Forest* Infogain attribute evaluator + Ranker Chi-squared attribute evaluator + Ranker ReliefF attribute evaluator +Ranker Gain ratio attribute evaluator + Ranker Consistency subset evaluator + Genetic search CFS subset evaluator + Bestfirst Classifier subset evaluator + Genetic SVM attribute evaluator + Ranker CFS subset evaluator + Bestfirst Chi-squared attribute evaluator + Ranker Classifier subset evaluator + Genetic Consistency subset evaluator + Genetic search Gain ratio attribute evaluator + Ranker Infogain attribute evaluator + Ranker ReliefF attribute evaluator + Ranker

Number of selected features 24 22 21 18 17 13 11 11 9 9 9 9 9 9 9 9

Accuracy (%)

Sensitivity (%)

Specificity (%)

F-measure

Precision

AUC

73.5 74.1 75.7 73.0 71.4 71.4 70.8 72.4

70.5 74.5 73.5 69.4 71.4 70.4 72.4 72.5

69.1 73.6 78.2 77.0 71.3 72.4 69.0 72.4

77.3

75.5

79.3

68.1 68.7 71.4 68.1 70.3 70.3 70.3

67.4 61.2 73.5 70.4 63.3 63.3 71.4

69.0 77.0 69.0 65.5 78.7 78.7 69.0

0.74 0.74 0.76 0.73 0.71 0.71 0.71 0.73 0.77 0.68 0.69 0.71 0.68 0.70 0.70 0.70

0.74 0.74 0.76 0.73 0.71 0.72 0.71 0.73 0.78 0.68 0.70 0.71 0.68 0.71 0.71 0.70

0.74 0.74 0.76 0.73 0.71 0.71 0.71 0.72 0.77 0.68 0.69 0.71 0.68 0.71 0.71 0.70

49

*Ranked features using program available at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm.

3


Page 25 of 27


Table IV Performance of different classifiers on training set of 155 complexes using the selected nine features with 10-fold crossvalidation Method Bayesian Logistic Regression Naive Bayes

Precision 0.63 0.73

AUC 0.61 0.75

Accuracy (%) 61.9 71.6

Sensitivity (%) 82.9 64.6

Specificity (%) 38.4 79.5

F-measure 0.60 0.72

Multilayer Perceptron

67.7

68.3

67.1

0.68

0.68

0.69

SMO (Support vector machines) IBK(K-nearest neighbors)

76.1 62.6

75.6 61.0

76.7 64.4

0.76 0.63

0.76 0.63

0.76 0.63

J48 decision tree Random Forest

68.4 65.8

69.5 75.6

67.1 54.8

0.68 0.65

0.68 0.66

0.66 0.68

4



Page 26 of 27

Table V Importance of individual attributes in the selected feature set S.No.

Attribute removed

1

Weights for α-helix at the window position of -6

Accuracy (%) 73.6

Sensitivity (%) 73.1

Specificity (%) 73.9

AUC

2

Weights for β -sheet at the window position of -6

73.6

73.1

73.9

0.74

3

Weights for β -sheet at the window position of -3

74.8

73.1

76.7

0.75

4

74.2

72

76.7

0.74

5

Weights for β -sheet at the window position of 5 Principal property value z2 (Side chain bulk)

74.9

74.4

75.3

0.75

6

Number of predicted binding site residues in receptors

73.5

72

75.3

0.74

7

Number of aromatic and positively charged residues in predicted binding sites of receptors

74.8

74.4

75.3

0.75

8

Number of aromatic and positively charged residues in predicted binding sites of ligands

74.8

75.6

74

0.75

9

Percentage of aromatic and positively charged residues in predicted binding sites of ligands

71.6

74.4

68.5

0.71

0.74

5


Page 27 of 27


Figure 1: Analysis of selected features for their discriminative ability: (a) Number of aromatic and positively charged residues in predicted binding sites of ligands; (b) Weights for beta-sheet at the window position of 6

12x13mm (600 x 600 DPI)


A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data.

Multiclass classification of sarcomas using pathway based feature selection method.

Feature Selection Based on Machine Learning in MRIs for Hippocampal Segmentation.

Heartbeat classification using disease-specific feature selection.

Classification of breast cancer patients using somatic mutation profiles and machine learning approaches.

An Enhanced Grey Wolf Optimization Based Feature Selection Wrapped Kernel Extreme Learning Machine for Medical Diagnosis.

Automatic classification of epilepsy types using ontology-based and genetics-based machine learning.

Ensemble Feature Learning of Genomic Data Using Support Vector Machine.

Feature selection and classification of leukocytes using random forest.

Hybridizing Feature Selection and Feature Learning Approaches in QSAR Modeling for Drug Discovery.

Parallel classification and feature selection in microarray data using SPRINT.

Feature Extraction and Classification of EHG between Pregnancy and Labour Group Using Hilbert-Huang Transform and Extreme Learning Machine.

Ensemble selection for feature-based classification of diabetic maculopathy images.

Classification of sodium MRI data of cartilage using machine learning.

Cancer Feature Selection and Classification Using a Binary Quantum-Behaved Particle Swarm Optimization and Support Vector Machine.

Analysis of cytokine release assay data using machine learning approaches.

Classification of Paediatric Inflammatory Bowel Disease using Machine Learning.

Automated system for lung nodules classification based on wavelet feature descriptor and support vector machine.

Feature Extraction and Machine Learning for the Classification of Brazilian Savannah Pollen Grains.

Classification of P-glycoprotein-interacting compounds using machine learning methods.

A feature selection based framework for histology image classification using global and local heterogeneity quantification.

Feature Subset Selection for Cancer Classification Using Weight Local Modularity.

MCI classification.

Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection.