Journal of Structural Biology 189 (2015) 9–19
Contents lists available at ScienceDirect
Journal of Structural Biology journal homepage: www.elsevier.com/locate/yjsbi
Improvement of protein binding sites prediction by selecting amino acid residues’ features Georgina Mirceva ⇑, Andrea Kulakov Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Skopje, Macedonia
a r t i c l e
i n f o
Article history: Received 17 April 2014 Received in revised form 30 August 2014 Accepted 23 November 2014 Available online 3 December 2014 Keywords: Protein binding site Protein function Protein interaction Feature selection technique (FST) Feature transformation technique
a b s t r a c t One of the main focuses of bioinformatics community is the study of the relationship between the structure of the protein molecules and their functions. In the literature, there are various methods that consider different protein-derived information for predicting protein functions. In our research, we focus on predicting the protein binding sites, which could be used to functionally annotate the protein structures. In this paper we consider a set of sixteen amino acid residues’ features, and by applying various feature selection techniques we estimate their significance. Although the number of features in our case is not high, we perform feature selection in order to improve the prediction power and time complexity of the prediction models. The results show that by applying proper feature selection technique, the predictive performance of the classification algorithms is improved, i.e., by considering the most relevant features we induce more accurate models than if we consider the entire set of features. Furthermore, the model complexity, as well as the training and testing times are decreased by performing feature selection. We also compare our approach with several existing methods for protein binding sites prediction. The results demonstrate that the existing methods considered in this research are specific and applicable to the group of proteins for which the model was developed, while our approach is more generic and can be applied to a wider class of proteins. Ó 2014 Elsevier Inc. All rights reserved.
1. Introduction Protein molecules constitute an important part of the cells in living organism due to their involvement in various essential processes in the cells. Each protein has particular functions in a cell. High-throughput technologies provide vast amount of data stored in protein databases. This data could be used to annotate proteins whose functions are not discovered yet. Various methods could be used for annotating protein structures in experimental manner. However, these methods are labour intensive, expensive and time-consuming. Subsequently, protein annotation could not be performed at a speed comparable to that of discovery of new protein structures. Since many research groups work on protein function prediction, it was necessary to define a standard for unified representation of the knowledge about proteins’ annotations, thus Gene Ontology (GO) (The Gene Ontology Consortium, 2008) was
⇑ Corresponding author at: ul. ‘‘Rugjer Boshkovikj’’ 16, P.O. Box 393, 1000 Skopje, Macedonia. Fax: +389 2 3088 222. E-mail addresses: georgina.mirceva@finki.ukim.mk (G. Mirceva), andrea.kulakov @finki.ukim.mk (A. Kulakov). http://dx.doi.org/10.1016/j.jsb.2014.11.007 1047-8477/Ó 2014 Elsevier Inc. All rights reserved.
introduced. GO is a controlled and structured vocabulary of the protein annotation terms, which are divided in three groups: molecular function, biological process and cellular component. For each annotation an evidence code is stored, which indicates the manner in which the annotation was discovered. In Du Plessis et al. (2011), an analysis of the evidence codes from GO is performed. This analysis shows that as of April 2010, 98.08% of the annotations are computationally discovered and are not curated, 0.7% are computationally inferred and are curated, while only 0.61% of the annotations are experimentally discovered. From this fact, the importance of computational methods in predicting protein functions is understandable. In the literature, there is a range of computational methods for annotating protein structures. We categorize the computational methods for annotating protein structures into six main groups. The first group of methods inspects the homology in protein sequences, since homologous proteins are more likely to share common functions. Therefore, comparison of a query sequence with the known protein sequences (Altschul et al., 1990) can be performed in order to identify homologous proteins. Nevertheless, some newly discovered sequences will not have a homolog among the known proteins, thus other approaches are often required. The second group of methods (Sigrist et al., 2010) annotates protein
10
G. Mirceva, A. Kulakov / Journal of Structural Biology 189 (2015) 9–19
structures based on signatures (motifs) found in their sequences. The third group of methods determines protein annotations based on structure similarity. Protein structures have higher conservation than sequences, so structurally similar proteins are more likely to have similar functions. There are many methods for protein structure retrieval, like the methods proposed in Holm and Sander (1993), Shindyalov and Bourne (1998), Ye and Godzik (2004), etc. The fourth group of methods annotates protein structures by detecting the protein binding sites (Tuncbag et al., 2009) based on the amino acids residues’ features. A recent publication (Lu et al., 2013) presents the most widely used features for binding sites prediction, and reviews the latest methods for binding sites prediction. The fifth group of methods (Panchenko et al., 2004) annotates protein structures based on the conserved parts of the sequences/structures that do not change throughout evolution. Protein Interactions by Structural Matching (PRISM) method (Keskin et al., 2008) considers both sequence and structure conservation to identify the binding sites of the template structures. Then, the binding sites of the query are determined by structural matching with the template structures. The sixth group of methods annotates protein structures using protein–protein interaction networks (Sharan et al., 2007). In our research we focus on developing methods for protein binding site prediction. There are various methods used for this purpose, like distance-based methods (Mihel et al., 2008; Ofran and Rost, 2003; PRINT, 2013), methods that examine the sequence and/or structure conservation (Aytuna et al., 2005; Capra and Singh, 2007; Jones and Thornton, 1997), methods based on identifying pockets (An et al., 2005; Hendlich et al., 1997; Laskowski, 1995), etc. Also, there are methods that combine various information. For example, ConCavity (Capra et al., 2009) considers both sequence conservation and 3D structure to make more accurate predictions. Accessible Surface Area (ASA) (Shrake and Rupley, 1973), Relative ASA (RASA), depth index (DPX) (Pintar et al., 2003), protrusion index (CX) (Pintar et al., 2002) and hydrophobicity (Kyte and Doolittle, 1982) are the most widely used amino acid residues’ features. Since an amino acid residue is constituted of several atoms, we can extract several features regarding ASA and RASA by summing the characteristics over different sets of atoms (all atoms, backbone atoms, side-chain atoms, polar atoms or non-polar atoms), as in Mihel et al. (2008). Also, DPX and CX of an amino acid residue could be calculated as an average, maximum or minimum of the DPXs or CXs of the atoms that constitute the residue (Mihel et al., 2008). In Mirceva and Kulakov (2012a,b), the models for protein binding sites prediction are induced by considering only total ASA, average DPX, average CX and hydrophobicity. Besides these four features, we can consider some additional features of the amino acid residues, and then by using an appropriate technique we can select the most valuable set of features. Although the number of features is not very high, the number of samples (amino acid residues) that would be used is huge, thus reducing the dimensionality of the dataset is an important issue. By applying appropriate feature selection and transformation techniques, besides decreasing model’s complexity and reducing training and testing times, also the prediction power of the models could be increased due to elimination of the irrelevant features. Furthermore, the models with lower complexity are more easily interpretable. From the feature transformation techniques, we apply Principal component analysis (PCA) (Abdi and Williams, 2010; Pearson, 1901) to transform the original features into new ones, and then we perform reduction. The feature selection techniques (FSTs) generally can be divided as filter and wrapper techniques (Kohavi and John, 1997). The former are not related to the model induction method and do not optimize the final criterion, while in wrapper
techniques the induction method is used to build models using different subsets of features, and the optimal subset is chosen to maximize the final criterion. There are embedded schemes where the induction algorithm has embedded FSTs. In this research we focus on the filter and wrapper techniques. From the filter FSTs, in this research we consider several techniques that independently rank the features (Hunt et al., 1966; Kira and Rendell, 1992; Liu and Setiono, 1995; Quinlan, 1993), and then we consider the features with highest ranks. However, considering the top n features do not mean that the optimal subset of n features is chosen. Namely, the top ranked features could have high dependency with the class attribute, but may also have high dependency between themselves too. Therefore, we can use techniques that evaluate subsets of features. We identify three categories of these techniques, i.e., exponential, sequential and randomized. We consider the exhaustive search, which is an exponential method. This methodology is, however, time intensive and could be used if the number of features is not high, as in our case. In order to avoid examination of all subsets of features, a heuristic approach (Pearl, 1984) could be used. We use several sequential techniques that sequentially add or remove features, but they can get stuck in a local minima. Using randomization, with genetic algorithms (Goldberg, 1989) we can escape local minima. Another popular method is the minimal-Redundancy-Maximal-Relevance (mRMR) technique (Peng et al., 2005), which was recently used for feature selection in protein disulphide bond prediction (Niu et al., 2013) and protein–protein interaction prediction (Liu et al., 2013). From the filter techniques, we also use the technique given in (Guyon et al., 2002). In this paper, we induce models for protein binding sites prediction in three steps. In the first step, we extract several features for each amino acid residue. In the second step, a range of FSTs are used to identify the most valuable features. Finally, in the third step, prediction models are induced by using several classification methods (Freund and Mason, 1999; Friedman et al., 1997; Gama, 2004; Huang et al., 2008; John and Langley, 1995; Kohavi, 1996; Quinlan, 1993; Senge and Hüllermeier, 2011). With applying appropriate FSTs we expect that the prediction models would be improved. Also, we make comparison with several existing methods for protein binding site prediction that are proposed in An et al. (2005), Aytuna et al. (2005), Capra et al. (2009), Capra and Singh (2007), Hendlich et al. (1997), Jones and Thornton (1997), Keskin et al. (2008), Laskowski (1995), Mihel et al. (2008), Ofran and Rost (2003), PRINT (2013). The rest of this paper is organized as follows. In Section 2, we present our approach for protein binding sites prediction. Section 3 shows some results of the evaluation of our approach. Also, we compare our approach with several existing methods for protein binding sites prediction. Finally, in Section 4 we conclude the paper and point out possibilities for further improvements.
2. Materials and methods 2.1. Extraction of the amino acid residues’ features Protein chains are constructed from amino acid residues that contain several atoms. First, we extract several features for each atom, and then we calculate the features for each residue based on the features of its atoms. We consider the following features: Accessible Surface Area (ASA) (Shrake and Rupley, 1973), Relative ASA (RASA), depth index (DPX) (Pintar et al., 2003), protrusion index (CX) (Pintar et al., 2002) and hydrophobicity (Kyte and Doolittle, 1982). We use the Protein Structure and Interaction Analyzer (PSAIA) software (Mihel et al., 2008) for extracting these features of the amino acid residues.
11
G. Mirceva, A. Kulakov / Journal of Structural Biology 189 (2015) 9–19
ASA is introduced in Lee and Richards (1971), and is calculated using the rolling ball algorithm (Shrake and Rupley, 1973) where a probe sphere with a predefined radius (typically 1.4 Å) is used to estimate the surface of an atom that is accessible to the probe sphere. For each amino acid residue we calculate its total ASA, main-chain ASA, side-chain ASA, polar ASA and non-polar ASA, by summing ASA over various set of atoms (Mihel et al., 2008). Many residues are hidden in the protein interior and could not be reached by the probe sphere. These residues could not be part of a binding site, so we filter only the surface amino acid residues. We consider that a given residue is at the protein surface if at least 5% of its surface is accessible by the probe sphere (Chothia, 1976). Different amino acids have different number of atoms, thus ASA would be higher for some amino acids. Therefore, RASA could be used, which is a ratio between the ASA of a residue and the standard ASA (Hubbard and Thornton, 1993) of the corresponding amino acid. As for ASA, we extract the total RASA, main-chain RASA, side-chain RASA, polar RASA and non-polar RASA (Mihel et al., 2008). DPX (Pintar et al., 2003) of an atom shows how far the atom is from its nearby atom that is accessible to the probe sphere. The depth index of the atoms that are accessible to the probe sphere is zero, and is greater than zero for the other atoms. CX is introduced in Pintar et al. (2002), and it indicates the density of the region where a given atom is located. The number of non-hydrogen atoms Natom within a sphere of radius R = 10 Å around the atom is calculated. The volume around the atom occupied by the protein is estimated as Vint = Natom * Vatom, where Vatom is the mean volume of an atom (20.1 Å3). The difference between the entire volume of the sphere and the volume occupied by the protein is denoted as Vext. Finally, the protrusion index is calculated as Vext/Vint. For each amino acid residue we calculate the average, minimum and maximum of the DPXs and CXs of its atoms. Minimum DPX is zero for all surface residues, so we discard this feature. Hydrophobicity (Kyte and Doolittle, 1982) indicates the hydrophobic properties of the amino acids. Namely, hydrophobic amino acids are more commonly found in the protein interior, and hydrophilic amino acids are typically located near the protein surface. We consider the hydrophobicity scale introduced by Kyte and Doolittle (1982). 2.2. Dataset description As a standard of truth, we use the Biomolecular Interaction Network Database (BIND) (Bader et al., 2001) that holds knowledge regarding the protein binding sites that are determined experimentally. Since the number of known protein structures is huge and there is a large redundancy among them, therefore usually only the representative protein chains with low sequence similarity are considered. The test dataset is formed by the residues of 3530 chains that have less than 10% sequence similarity between themselves, using the criterion given in Chandonia et al. (2004). Using the same criterion, the training dataset is formed by the residues of 633 chains that are not previously used in forming the test dataset and have less than 20% sequence similarity. Next, we filter the surface residues as described before, and thus we obtain 115,579 residues in the training and 625,939 in the test dataset. From the training residues, only 15,696 (around 13.58%) are part of binding sites according to BIND (Bader et al., 2001), meaning that the non-binding sites class is dominant. In order to prevent inducing models that are biased toward the dominant class, we balance the training dataset by down sampling the dataset to 27% of its original size without replacement of the samples and by following uniform distribution of the class attribute. The test dataset, named B3530, remains unbalanced, thus for evaluation
purposes some appropriate evaluation measure should be used. After balancing, the features are normalized in the interval [0,1]. The test dataset B3530 is used for evaluation of our approach using various FSTs combined with various classifiers. In Table 1 we provide summary statistics of the datasets used in this research. We compare our approach with the methods proposed in An et al. (2005), Aytuna et al. (2005), Capra et al. (2009), Capra and Singh (2007), Hendlich et al. (1997), Jones and Thornton (1997), Keskin et al. (2008), Laskowski (1995), Mihel et al. (2008), Ofran and Rost (2003), PRINT (2013). For evaluation of the methods given in Aytuna et al. (2005), Jones and Thornton (1997), Mihel et al. (2008), Ofran and Rost (2003), we use the PSAIA software (Mihel et al., 2008), while for the other methods we use the pre-calculated predictions available at PRISM, PRINT and ConCavity websites. We use two datasets in the comparison. B1549 test dataset is formed by considering the chains from B3530 for which we obtained predictions by the methods given in Aytuna et al. (2005), Capra et al. (2009), Capra and Singh (2007), Jones and Thornton (1997), Keskin et al. (2008), Mihel et al. (2008), Ofran and Rost (2003), PRINT (2013). The other dataset used in the comparison includes the knowledge stored in the LigASite v7.0 database (Dessailly et al., 2008), which contains biologically relevant binding sites in proteins with known apo-structures. This database contains both the redundant and non-redundant (