Prediction of FMN-binding residues with three-dimensional probability distributions of interacting atoms on protein surfaces.

Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Prediction of FMN-binding residues with three-dimensional probability distributions of interacting atoms on protein surfaces Rajasekaran Mahalingam a,n, Hung-Pin Peng a,b,c, An-Suei Yang a,nn a

Genomics Research Center, Academia Sinica, 128 Academia Rd., Sec. 2, Nankang Dist., Taipei 115, Taiwan Institute of Biomedical Informatics, National Yang-Ming University, Taipei 11221, Taiwan c Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei 115, Taiwan b

H I G H L I G H T S

First structure-based approach for prediction of protein–FMN interaction. Does not require evolutionary information for the prediction. Useful in annotating proteins structures of unknown function and computational protein models.

art ic l e i nf o

a b s t r a c t

Article history: Received 26 July 2013 Received in revised form 29 October 2013 Accepted 30 October 2013

Flavin mono-nucleotide (FMN) is a cofactor which is involved in many biological reactions. The insights on protein–FMN interactions aid the protein functional annotation and also facilitate in drug design. In this study, we have established a new method, making use of an encoding scheme of the threedimensional probability density maps that describe the distributions of 40 non-covalent interacting atom types around protein surfaces, to predict FMN-binding sites on protein surfaces. One machine learning model was trained for each of the 30 protein atom types to predict tentative FMN-binding sites on protein structures. The method's capability was evaluated by five-fold cross-validation on a dataset containing 81 non-redundant FMN-binding protein structures and further tested on independent datasets of 30 and 15 non-redundant protein structures respectively. These predictions achieved an accuracy of 0.94, 0.94 and 0.96 with the Matthews correlation coefficient (MCC) of 0.53, 0.53 and 0.65 respectively for the three protein structure sets. The prediction capability is superior to the existing method. This is the first structure-based approach that does not rely on evolutionary information for predicting FMN-interacting residues. The webserver for the prediction is available at http://ismblab. genomics.sinica.edu.tw/. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Structure-based Computational method Machine learning Drug discovery Functional annotation

1. Introduction FMN is an essential cofactor in flavoproteins which are involved in (i) redox reactions in the energy producing metabolic pathways and (ii) non-redox reactions in which FMN acts as acid or base in the covalent-intermediate formation (Mansoorabadi et al., 2007; Serrano et al., 2012). Flavodoxin, which is one of the flavoproteins, is considered as one of the potential drug targets against microbial infections because it plays a critical role in the electron transfer

n Corresponding author. Current address: Department of Physiology and Biophysics, School of Medicine, Case Western Reserve University, 10900 Euclid Ave., Cleveland, OH 44106, USA. Tel.: þ 1 216 368 8654. nn Corresponding author. Tel.: þ 886 2 2787 1232. E-mail addresses: [email protected] (R. Mahalingam), [email protected] (A.-S. Yang).

system of pathogenic bacteria but not in mammals. Helicobacter pylori flavodoxin acts as an electron acceptor in pyruvate metabolic pathway, and thus inhibition of this protein can affect the bacterial survival (Cremades et al., 2005). Chorismate synthase is another FMN-binding protein involved in shikimate pathway (Macheroux et al., 1999) and is considered as a primary target in developing antibacterial therapeutics against tuberculosis (Fernandes et al., 2007). Hence, identification of FMN-binding proteins and binding site residues can aid in the drug discovery processes for antimicrobial therapeutics. A computational method for predicting the FMN-binding residues on proteins would greatly facilitate defining FMNbinding sites on protein structures. Computational methods have been developed to predict FMN (Wang et al., 2012), flavin adenine dinucleotide (FAD) (Mishra and Raghava, 2010) and nicotinamide adenine dinucleotide (NAD) (Ansari and Raghava, 2010) binding

0022-5193/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jtbi.2013.10.020

Please cite this article as: Mahalingam, R., et al., Prediction of FMN-binding residues with three-dimensional probability distributions of interacting atoms on protein surfaces. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.10.020i

R. Mahalingam et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

sites. These computational methods are reasonably successful in their respective predictions. Nevertheless, they are all sequencebased predictors relying on evolutionary information. Consequently, these methods may have difficulty in predicting binding site in orphan proteins, which shares very low sequence similarity with existing proteins. A structure-based method which does not rely on evolutionary information has yet to be developed. In this study, we have developed such a structure-based method for the prediction of FMN-interacting residues on protein surfaces. This method uses machine learning approach to predict FMNbinding sites on protein surfaces by recognizing characteristic interacting atom distribution patterns associated with the FMN binding. The basic principle has been already applied to predict the protein–protein (Chen et al., 2012) and protein–carbohydrate (Tsai et al., 2012) interactions successfully. Here we extend this approach to predicting FMN-interacting residues. In the prediction, protein surface atoms were first categorized into 30 atom types and one machine learning model was trained for each of the atom types. The input attributes for the machine learning algorithm were normalized distance–weighted sum of threedimensional probability density maps (PDMs) of 40 interacting atom types (30 atom types from protein, one from water and nine from FMN) on the protein surfaces. The PDMs around the query protein atoms for the protein interacting atom types and water have been described in previous publications (Chen et al., 2012; Tsai et al., 2012); the PDMs for the nine FMN interacting atom types were constructed with the protein–FMN interacting atom pairs from the dataset of 192 FMN–protein complex structures. The machine learning algorithm learned the patterns of the attributes to distinguish the binding atoms from the non-binding atoms on the protein surfaces. We evaluated our predictor performance on the training dataset P81 as well independent test sets P30 and P15 (Wang et al., 2012). The results indicate that our approach is the best method for predicting the FMN-binding sites on protein structures.

2. Materials and methods

Table 1 Protein and FMN atom types. ID #

Atom type

Radius (Å)

Description

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

NH1 C CH1E O CH0 CH1S CH2E CH3E CR1E OH1 OC OS CH2G CH2P NH1S NC2 NH2 CR1W CY2 SC CF SM CY CW CRHH NH3 CR1H C5 N C5W HOH P O1O2 N1N9 N C N O C.3 O.3

1.65 1.76 1.87 1.40 1.76 1.87 1.87 1.87 1.76 1.40 1.40 1.40 1.87 1.87 1.65 1.65 1.65 1.76 1.76 1.85 1.76 1.85 1.76 1.76 1.76 1.50 1.76 1.76 1.65 1.76 1.40 2.10 1.68 1.82 1.82 1.90 1.82 1.68 1.90 1.68

Backbone NH Backbone C Backbone CA (exc. Gly) Backbone O Arg CZ, Asn CG, Asp CG, Gln CD, Glu CD Sidechain CH1: Ile CB, Leu CG, Thr CB, Val CB Tetrahedral CH2 (except CH2P,CH2G) All CB Tetrahedral CH3 Aromatic CH (except CR1W, CRHH, CR1H) Alcohol OH (Ser OG, Thr OG1, Tyr OH) Carboxyl O (Asp OD1, OD2, Glu OE1, OE2) Sidechain O: Asn OD1, Gln OE1 Gly CA Pro CB, CG, CD Sidechain NH: Arg NE, His ND1, NE1, Trp NE1 Arg NH1, NH2 Asn ND2, Gln NE2 Trp CZ2, CH2 Tyr CZ Cys S Phe CG Met S Tyr CG Trp CD2, CE2 His CE1 Lys NZ His CD2 His CG Pro N Trp CG Water Phosphate Phosphate oxygen Base ring ring link link Sp3 carbon Sp3 oxygen

The protein atom types 1–31 have been previously defined by Laskowski et al. (1996) with minor modifications. The atom types 32–40 were defined in this work for FMN molecule.

2.1. Dataset The training and test datasets except P15 were obtained from Wang et al. (2012). The autor obtained 111 protein chains from PDB (Berman et al., 2002). Then they randomly selected 30 proteins (P30) for the independent test and the remaining 81 protein chains (P81) were used as a training set. For the P15, we extracted protein–FMN complexes from PDB and then used PISCES program (Wang and Dunbrack, 2003) to remove structures which has the sequence identity more than 10% with P81 and P30 datasets that finally yielded 15 protein chains.

2.2. Construction of three-dimensional probability density maps on protein surfaces The detailed method for the PDMs construction has been discussed previously (Chen et al., 2012; Tsai et al., 2012; Yu et al., 2012). In brief, the interacting atom types from protein, water, and FMN are shown in Table 1. The PDMs for these interacting atom types were constructed with interacting atoms retrieved from the interacting atom database described previously (Chen et al., 2012; Tsai et al., 2012; Yu et al., 2012). The interacting atom database for protein–FMN interacting atom pairs was constructed with the dataset of 192 protein–FMN complexes.

2.3. PDM-based attributes as inputs for machine learning algorithms Protein surface atoms were categorized into 30 protein atom types (Table 1, 1–30), and one machine learning model was trained for each of the atom types. The input attributes for each of the protein atom i (ai, j (j¼ 1,41): 40 attributes from the 40 interacting atom type PDMs plus one attribute from geometry) for each of the machine learning models were calculated from the PDMs on the protein surface and from the geometry of the protein surface as the following: for each atom i on the surface of the query protein (solvent accessible surface area of atom i40), the PDM values associated with the grids within 5 Å radius centered at the atom are summed in the following equation: r

̊

si;j ¼ ∑ki;k r 5 A g k; j

ð1Þ

where Si,j is the PDM sum for interacting atom type j at atom i; ri,k is the distance between atom i to a grid point k; and gk,j is the PDM value of interacting atom type j at grid point k. Ai,j (j¼1,40) associated with each atom i was calculated with the following equation: d

Ai;j ¼ Si;j þ ∑ki;k

r 10 Å

2

d

Sk; j di;k =∑ni;n

r 10 Å

2

di;n

ð2Þ

where Si,j is defined in Eq. (1); di,k is the distance between atom i and atom k. The attribute set (ai, j (j¼1,40)) for the machine learning



models on atom i was derived from Ai, j (j¼1,40) with the following scaling scheme:

2.6. Performance measure The predictor's performance is evaluated by different measures such as accuracy (Acc), precision (Pre), sensitivity (Sen), specificity (Spc), F-score (Fsc) and the Matthews correlation coefficient (MCC).

if Ai; j 4 M max ; j then ai; j ¼ 1; otherwise if Ai; j 4 M min ; j then ai; j ¼ 0; otherwise ai;j ¼ Ai;j M min ; j =M max ; j M min ; j

3

ð3Þ

where Mmax,j is the median of the distribution of the maximal Ai,j from each of the proteins in P81 and Mmin,j is the median of the distribution of the minimal Ai,j of the proteins in P81. The ai,j (j¼ 1–40) are the first 40 attributes for machine learning and the 41st attribute (geometry) for the atom i was the fraction of the space not occupied by the van der Waals volume of the protein in the 10 Å sphere centered at the atom i. This attribute was also scaled between 0 and 1 as in Eq. (3).

Acc ¼ TP þ TN=TP þ TN þFP þ FN

ð4Þ

Pre ¼ TP=TP þ FP

ð5Þ

Sen ¼ TP=TP þ FN

ð6Þ

Spe ¼ TN=TN þ FP

ð7Þ

F score ¼ 2 Pre Sen=Pre þ Sen

ð8Þ

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MCC ¼ TP TN FP FN= ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ

ð9Þ 2.4. FMN-interacting site prediction with artificial neural network One machine learning model was trained for each of the 30 protein atom types with the positive and negative cases found in P81. A positive case was a true binding atom. For each of the 30 atom types, artificial neural network (ANN) predictors were trained and validated. The detailed methodology of ANN has been described previously (Chen et al., 2012; Tsai et al., 2012). The input layer consisted of 41 nodes, for which the input attributes are described in Eqs. (1)–(3). The hidden layer had layers that had 84 nodes, twice the sum of the input and the output. The output layer had a single node with the activity value between 0 and 1, matching the negative (0) and positive case (1) respectively. The learning rate for both the hidden layer and the output layer was 0.01 and the momentum was 0.1. The training iteration was stopped as the mean absolute error between the ANN output values and the target values converged. The parameter set and the architecture of ANN were determined empirically for optimal performance. Since non-binding atoms in the training set greatly outnumbered binding atoms, ordinary machine learning algorithms would produce learning biases without suitable treatment. The methodology included multiple predictors to produce an ensemble of prediction results (Breiman, 1996). Each individual classifier in the predictor ensemble was trained with a different sampling (bag) of the training set, and the final prediction was calculated by averaging with equal weight the output values from the predictors (Manning et al., 2007)). In each bag, all of the positive cases were included, along with randomly sampled negative cases that were 1.5 times as many as positive cases. The bag number was set to four, which balanced the need for effectiveness and training efficiency. All the four bags were used to train ANN models. Each of the ANN model trained for 1000 iterations. During training, the model was tested on validation set after every 10 training iterations. The number of training iteration which yielded the best MCC on the validation set was used to determine the parameters for predictors.

where TP, FN, FP and TN are the numbers of true positive, false negative, false positive and true negative residues in the prediction respectively. Sensitivity (also known as recall) can be viewed as a measurement of completeness, whereas precision is a measurement of exactness or fidelity. MCC is a measurement of the quality of two class classifications (positive and negative). Its value ranges between 1 and 1; random correlation gives MCC of 0 while perfect correlation yields 1 in MCC. 2.7. Prediction based on confidence level The ANN output values ranging from 0 to 1 were normalized to prediction confidence level and based on this tentative FMNbinding patch were produced on protein surface. From the validation set, the machine learning models outputs for 30 protein atom types were sorted into bins of interval 0.1. The confidence level of each of the bin was calculated as the fraction of true positive over the total number of predictions in the bin. Finally, lookup-tables were constructed based on the output–confidence relationships and then the outputs from the machine learning models were converted to prediction confidence levels with these lookuptables. 2.8. Prediction of patches of atoms as protein–FMN binding sites FMN-binding site was predicted by a cluster of surface atoms predicted as positive cases with high prediction confidence level. Protein surface atoms in FMN-binding sites with prediction confidence level greater than 50% were used as cluster centers to include neighboring surface atoms within radius of 9 Å. Within each of the surface patches, all the surface atoms with the confidence level for positive prediction greater than 30% were included in the tentative patch of atoms as a FMN-binding site. If the pairwise distance of any two seeds was within 7 Å, the two corresponding patches were merged as one patch. The parameters were optimized for residue-based prediction accuracy with the validation set. 2.9. Residue-based predictions for the FMN-binding sites

2.5. Five-fold cross-validation The prediction is evaluated by the five-fold cross-validation. The whole P81 dataset was randomly divided into five sets with approximately equal size and out of these five sets three sets are used for training, one set for validation and the final one for testing. The final performance is the averaging the results of all five test sets.

To convert the atom-based binding site prediction to residuebased we applied a heuristic procedure: only the residues with any surface atoms included in the atom-based binding patch were considered as positive residues for the residue-based patch. Similarly, actual binding sites for the protein–FMN complex at the residue level were defined by patches of positive residues, each of which any surface atoms fall within 5.0 Å distance with any FMN atoms is defined as FMN-binding site atoms. This definition



4

enabled the comparison of prediction results with actual binding sites at the residue level. The percentage parameter was optimized for residue-based prediction accuracy with the validation set. 2.10. Mann–Whitney U-test The Mann–Whitney U-test is a non-parametric statistical method to test whether two groups of numerical values come from identical continuous distributions of equal medians – increasing p-value indicates decreasing difference of the two distributions and p-value of 1 indicates that the two distributions are statistically indistinguishable. The Mann–Whitney U-tests was carried out with the statistic tool ranksum in MATLAB (http:// www.mathworks.com/help/toolbox/stats/ranksum.html).

3. Results and discussion 3.1. Evaluation of the protein surface attributes which characterize FMN interaction sites To understand the contribution of protein surface attributes in distinguishing the FMN-binding sites from non-binding sites, we analyzed protein surface attributes in the training set P81. Fig. 1

shows the Mann–Whitney U-test p-value results (see Section 2) for each attribute type j (x-axis in Fig. 1) on each protein atom type i (y-axis in Fig. 1) calculated with two groups of Ai,j (defined in Eq. (2)). One group of Ai,j was calculated for the protein surface atoms of type i in the FMN-binding sites in the P81 dataset and the other group of Ai,j was calculated for the non-FMN-binding atom of type i in the same dataset. The y-axis (Fig. 1) is the protein atom types i¼1–30 (atom types 1–30, Table 1), the x-axis is the interacting atom types j¼ 1–40 (atom types 1–40, Table 1) and 41st attribute reflecting the local geometry of the protein surface (see Section 2). The p-value of the U-test is color-coded as shown in the figure. The plus (þ) sign in the matrix element indicates that the averaged feature value for the FMN-binding atoms is larger than the averaged feature value for non-binding atoms and the negative ( ) sign indicates vice-versa. The statistical analysis revealed that space around protein atoms types y¼1–4 (backbone atoms) and 5–9 (aliphatic, hydrophobic, and aromatic carbons) were enriched with higher densities of interacting atom types of x¼ 32–40 from FMN, indicating that the FMN-binding sites are composed of these protein atom types. These protein atom types were also enriched with PDMs from protein backbone interacting atom types (x ¼2–4) and sidechain carboxyl and carbonyl oxygen (x ¼11 12) and aliphatic carbons (x ¼5 6) near the above two polar functional groups,

Fig. 1. Mann–Whitney U-tests p-values on the 41 attributes for each of the 30 protein atom types. The ranksum function the MATLAB was used for calculation. Two sets of data were input to the function and the output p-value is the probability for the two distributions of data to be statistically indistinguishable. More details of this figure are described in the associated text.



indicating that these protein atom types in the FMN-binding sites prefer to interact with negatively charged carboxyl groups and partially negatively charged carbonyl groups, similar to the polar oxygens on FMN. This analysis suggested that the attribute sets are statistically significant in differentiating the binding site atoms from non-binding atoms on protein surfaces. 3.2. Performance of the atom-based prediction with machine learning models Machine learning models for each of 30 protein atom types were trained and cross-validated with the P81 dataset. The 41 attributes were used as inputs for each of the machine learning models. Further to estimate these attributes contribution in prediction accuracy, we created different subsets of the attributes and evaluated their performance. The subsets include P (attributes 1–30), W (attribute 31), F (attributes 32–40), G (attribute 41), PWF (attribute 1–40) and PWFG (attributes 1–41). Fig. 2A shows that the models which were trained with these subsets of attributes were able to reach an overall average MCC of 0.28, 0.04, 0.35, 0.11, 0.37 and 0.39. As expected, the results shown in Fig. 2A indicate that all 41 attributes together as input lead to the best MCC for the predictor of each of the 30 protein atom types. The blue histogram in Fig. 2B indicates that increasing prediction confidence level is correlated with increasing value of the attributes derived from FMN atoms (CR, P, O1O2, OL, NR, OL, O.3,

5

C.3, and N1N9) and protein backbone carbonyl atoms (C and O), in consistent with the results shown in Fig. 2A where the attribute subset F contributes to the majority of the prediction accuracy. On other hand, the attributes associated with the positively charged sidechain atoms (NH1S, NC2, and NH3) are negatively correlated with prediction confidence level (Fig. 2B). This again indicates that the FMN-binding protein atoms do not prefer to interact with positively charged atoms, in agreement with the fact that FMN is a negatively charged molecule. The correlation for these attributes versus true binding site (red histogram in Fig. 2B) shows similar trend as in the blue histogram, confirming that the attributes (x-axis) with higher correlation coefficient (y-axis) versus true

Table 2 Residue based FMN-binding site prediction benchmarks for training and independent tests sets. Dataset

Accuracy Recall Specificity Precision F-Score AUC MCC

P81 P30 P15 P81_Wang et al.a P30_Wang et al.a

0.94 0.94 0.96 0.87 0.88

0.53 0.50 0.64 0.87 0.71

0.97 0.98 0.98 0.87 0.89

0.61 0.64 0.70 NA 0.35

0.57 0.56 0.67 NA 0.47

0.90 0.88 0.92 0.94 NA

0.53 0.53 0.65 NA 0.44

NA – not available. a

All values are extracted/calculated from Wang et al. (2012).

Fig. 2. Analysis of the attributes. (A) The x-axis represents 30 atom types (Table 1) and the y-axis shows the MCC values from five-fold cross-validation of P81. The subsets of attributes are P (protein atom types), W (water), F (FMN atom types), G (geometry), PWF and PWFG (B) The blue histogram shows the correlations between prediction confidence levels and attributes derived from concentrations of PDMs. Pearson's correlation coefficients, which are the measurements for the linear correlations between the prediction confidence level and the attributes, are shown in the y-axis. The x-axis shows the feature types (Table 1), each of which corresponds to one of the ai,j (Eq. (2)). The red histogram shows Pearson's correlation coefficients between the positive or negative assignments for protein surface atoms and the attribute values for the protein surface atoms. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)


6


binding site (shown in red histogram) contribute more weight to the prediction accuracy (shown in blue histogram). 3.3. Performance of the residue-based prediction with machine learning models The outputs from atom-based machine learning models were converted to confidence values which are used to predict FMN-binding residues (Chen et al., 2012; Tsai et al., 2012) (for

details see Section 2). The residue-based prediction is more accurate (average MCC of 0.53) than the atom-based prediction (average MCC of 0.39) for P81 dataset (Table 2). Fig. 3 compares, with two examples, the residue-based predictions and atom-based predictions with actual FMN-binding sites. The predictors were further tested with the independent dataset P30 with proteins that are unseen by the trained machine learning models. The prediction of P30 was performed based on trained P81 models. The benchmarks of the test set P30 (Table 2)

Fig. 3. Examples of FMN-binding site predictions on the proteins in P81 set. The atoms colors in the atom-based prediction are based on the prediction confidence level. The colored bar at the bottom of the figure is the color code for the confidence level. The red colored atoms are the seeds for the FMN-binding site patch prediction. In the residue-based predictions, the predicted atoms with greater than 0.5 confidence level are colored in red and less than that are colored in orange. In the true-binding site, red colored atoms are the actual interface atoms. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 4. Residue-based MCC (y-axis) for each of the 20 amino acids (x-axis). The MCCs were calculated from the prediction results of the training set P81 (black) and the test set P30 (gray).



indicate that the predictors perform equally well in the predictions on new cases. The detailed analysis and interactive visualization for each of the proteins of these datasets are available at http:// ismbalab.genomics.sinica.edu.tw. Fig. 4 shows the prediction accuracy of each of the amino acid type on the query protein surfaces. The results show that residues other than Asp and Glu have relatively reasonable accuracy in prediction. The reason for the poor prediction rate on these two amino acid types is that they do not appear frequently enough in the FMN-binding sites to provide adequate training of the predictors.

3.4. Performance of the model in low-sequence similarity case To demonstrate our method's ability to predict the lowsequence similarity proteins, we created new dataset (P15) which shares o10% sequence similarity with P81 and P30 datasets. The trained P81 models were used to predict the FMN-binding residues on the proteins in the P15 dataset. The benchmarks are shown in Table 2. The performance of all three datasets (Fig. 5) is further evaluated by the receiver operating characteristic curve (ROC) which indicates P15 achieves better AUC (0.92) compared to

7

P81 (0.90) and P31 (0.88). The results show that our current method performs equally well on the cases of novel proteins. 3.5. Comparison with other method As of now the only method available for the FMN prediction is from Wang et al. and unfortunately they do not have the webserver for the prediction. Thus, we used their dataset P81 and P30 for training and testing for the comparison purpose. Table 2 compares the prediction results of the two works. Although the method in this work performed superiorly, the prediction accuracy has room for further improvement. In the future, increasing the database with more protein–FMN complexes would improve prediction accuracy. 3.6. FMN binding-site pattern analysis To understand the FMN binding-site pattern, we have analyzed the residue preferences in the binding sites (Fig. 6). This analysis revealed that Gly, Ser, Thr, Val, Tyr, Ala, Arg, Asn, Ile and Leu residues occurred most of the times in the binding sites. Among these residues Gly, Ser Thr and Ala are highly preferred in the binding site. Residues Cys and Glu are least preferred in the binding site. Previous analysis on NAD binding site showed that Gly, Ala, Asp, Glu, Ile, Asn, Arg, Ser, Thr, Val, Leu and Tyr were preferred residues in binding site (Mishra and Raghava, 2010). Based on this comparison, FMN prefers similar residues as FAD, however the preference level for negatively charged residues is low in the binding site.

4. Conclusions

Fig. 5. ROC plot summarizing the performance of P81, P31 and P15 datasets.

In summary, we have developed a structure-based method that employs probability density distribution of the interacting atoms on protein surfaces to predict the FMN-interacting residues on protein structures. The method was trained and tested on P81, P30 and P15 datasets. The results show that the prediction accuracy of the method in this work is superior to that of the other method judging by the overall MCC measurement. Moreover, the method is able to successfully predict the low-sequence similarity proteins, making this approach uniquely useful for the prediction of FMNbinding sites on proteins structures of unknown function and on

Fig. 6. Percentage of FMN-binding site residues occurrence in the binding sites.



8

computational protein models that need to be annotated with predicted structural information. Acknowledgment This work was supported by National Science Council (NSC 100IDP006-3 and NSC 99-2311-B-001-014-MY3), and Genomics Research Center at Academia Sinica (AS-100-TP2-B01). References Ansari, H.R., Raghava, G.P., 2010. Identification of NAD interacting residues in proteins. BMC Bioinform. 11, 160. Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J.D., Zardecki, C., 2002. The protein data bank. Acta Crystallogr. D Biol. Crystallogr. 58, 899–907. Breiman, L., 1996. Bagging predictors. Mach. Learn. 24, 123–140. Chen, C.T., Peng, H.P., Jian, J.W., Tsai, K.C., Chang, J.Y., Yang, E.W., Chen, J.B., Ho, S.Y., Hsu, W.L., Yang, A.S., 2012. Protein–protein interaction site predictions with three-dimensional probability distributions of interacting atoms on protein surfaces. PLoS One 7, e37706. Cremades, N., Bueno, M., Toja, M., Sancho, J., 2005. Towards a new therapeutic target: Helicobacter pylori flavodoxin. Biophys. Chem. 115, 267–276. Fernandes, C.L., Breda, A., Santos, D.S., Basso, L.A., Souza, O.N., 2007. A structural model for chorismate synthase from Mycobacterium tuberculosis in complex with coenzyme and substrate. Comput. Biol. Med. 37, 149–158.

Laskowski, R.A., Thornton, J.M., Humblet, C., Singh, J., 1996. X-SITE: use of empirically derived atomic packing preferences to identify favourable interaction regions in the binding sites of proteins. J. Mol. Biol. 259, 175–201. Macheroux, P., Schmid, J., Amrhein, N., Schaller, A., 1999. A unique reaction in a common pathway: mechanism and function of chorismate synthase in the shikimate pathway. Planta 207, 325–334. Manning, C.D., Raghavan, P., Schutze, H., 2007. An Introduction to Information Retrival. Cambridge University Press, Cambridge, England. Mansoorabadi, S.O., Thibodeaux, C.J., Liu, H.W., 2007. The diverse roles of flavin coenzymes – nature's most versatile thespians. J. Org. Chem. 72, 6329–6342. Mishra, N.K., Raghava, G.P., 2010. Prediction of FAD interacting residues in a protein from its primary sequence using evolutionary information. BMC Bioinform. 11 (Suppl. 1), S48. Serrano, A., Frago, S., Velazquez-Campoy, A., Medina, M., 2012. Role of key residues at the flavin mononucleotide (FMN): adenylyltransferase catalytic site of the bifunctional riboflavin kinase/flavin adenine dinucleotide (fad) synthetase from Corynebacterium ammoniagenes. Int. J. Mol. Sci. 13, 14492–14517. Tsai, K.C., Jian, J.W., Yang, E.W., Hsu, P.C., Peng, H.P., Chen, C.T., Chen, J.B., Chang, J.Y., Hsu, W.L., Yang, A.S., 2012. Prediction of carbohydrate binding sites on protein surfaces with 3-dimensional probability density distributions of interacting atoms. PLoS One 7, e40846. Wang, G., Dunbrack Jr., R.L., 2003. PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591. Wang, X., Mi, G., Wang, C., Zhang, Y., Li, J., Guo, Y., Pu, X., Li, M., 2012. Prediction of flavin mono-nucleotide binding sites using modified PSSM profile and ensemble support vector machine. Comput. Biol. Med. 42, 1053–1059. Yu, C.M., Peng, H.P., Chen, I.C., Lee, Y.C., Chen, J.B., Tsai, K.C., Chen, C.T., Chang, J.Y., Yang, E.W., Hsu, P.C., Jian, J.W., Hsu, H.J., Chang, H.J., Hsu, W.L., Huang, K.F., Ma, A.C., Yang, A.S., 2012. Rationalization and design of the complementarity determining region sequences in an antibody–antigen recognition interface. PLoS One 7, e33340.


Prediction of fatty acid-binding residues on protein surfaces with three-dimensional probability distributions of interacting atoms.

Predicting Ligand Binding Sites on Protein Surfaces by 3-Dimensional Probability Density Distributions of Interacting Atoms.

A practical overview on probability distributions.

Wireless network control of interacting Rydberg atoms.

A Survey of Tables of Probability Distributions.

Prediction of allosteric sites on protein surfaces with an elastic-network-model-based thermodynamic method.

Automated quantification of neurite outgrowth orientation distributions on patterned surfaces.

FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.

Localization of sialyl residues on cell surfaces by affinity cytochemistry.

Probability distributions of the electroencephalogram envelope of preterm infants.

X-ray radiation-induced addition of oxygen atoms to protein residues.

Transient quantum trapping of fast atoms at surfaces.

Probability Distributome: A Web Computational Infrastructure for Exploring the Properties, Interrelations, and Applications of Probability Distributions.

ProbOnto: ontology and knowledge base of probability distributions.

Pharmacokinetic prediction of tissue residues.

Accurate prediction of RNA-binding protein residues with two discriminative structural descriptors.

Prediction of aptamer-protein interacting pairs using an ensemble classifier in combination with various protein sequence attributes.

Measurements of gas hydrate formation probability distributions on a quasi-free water droplet.

Direct imaging of Pt single atoms adsorbed on TiO2 (110) surfaces.

Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction.

Influence of proline residues on protein conformation.

Homeodomain-interacting protein kinase 2 regulates DNA damage response through interacting with heterochromatin protein 1γ.

Long-range interacting many-body systems with alkaline-earth-metal atoms.

Protein-liposome conjugates with defined size distributions.