Journal of Theoretical Biology 365 (2015) 32–39

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

A two-layer classification framework for protein fold recognition Reza Zohouri Aram, Nasrollah Moghadam Charkari n Faculty of Electrical & Computer Engineering, University of Tarbiat Modares, Tehran, Iran

H I G H L I G H T S

 An individual method and a fusion method are proposed for protein fold recognition.  The proposed methods are based on two-layer classification.  The proposed methods improve the prediction accuracy by 2%–10% on a benchmark dataset.

art ic l e i nf o

a b s t r a c t

Article history: Received 2 June 2014 Received in revised form 9 September 2014 Accepted 19 September 2014 Available online 30 September 2014

Protein fold recognition is one of the interesting studies in bioinformatic to predicting the tertiary structure of proteins. In this paper, an individual method and a fusion method are proposed for protein fold recognition. A Two Layer Classification Framework (TLCF) is proposed as individual method. This framework comprises of two layers: in the first layer, the structural class of protein is predicted. The classifier in this layer classifies the instances into four structural classes: all alpha, all beta, alpha/beta, and alpha þbeta. Then, the classification results will be added as a new feature to further training and testing datasets. Using the results of the first layer, we employ another classifier for predicting 27 folding classes in the second layer. The results indicate that the proposed approach is very effective to improve the prediction accuracy where the measured values of MCC, specificity, and sensitivity are promising. TLCFn is proposed as a fusion method that exploits TLCF as a base model. The experimental results indicate that the proposed methods improve prediction accuracy by 2–10% on a benchmark dataset. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Supervised learning Ensemble classifiers Fusion system

1. Introduction The fold recognition problem is one of the fundamental problems in molecular biology. It is defined as ‘obtain threedimensional (3D) structure of proteins from their sequences without depending on sequence similarities’ (Ding and Dubchak, 2001). Identification of protein tertiary structure is of great importance since the main function of protein is determined by tertiary structure (Shenoy and Jayaram, 2010). Moreover, it plays an essential role in the design of new drugs and therapies. Nowadays, there is an immense gap between the known protein sequence and confirmed protein tertiary structure (Lee et al., 2009). Thus, introducing some efficient computational methods to predicting 3D structures from sequences might be considered as a way to solve the mentioned problem. Computational methods

n

Corresponding author. Tel.: þ 98 2182883301; fax: þ 98 2182884325. E-mail addresses: [email protected] (R.Z. Aram), [email protected] (N.M. Charkari). http://dx.doi.org/10.1016/j.jtbi.2014.09.032 0022-5193/& 2014 Elsevier Ltd. All rights reserved.

have been used for predicting 3D structures for more than four decades. There are two popular classes of computational methods for predicting the tertiary structure: (a) Template-Based Methods (TBM) (b) Ab initio methods (Lee et al., 2009). For identifying the tertiary structure of a given sequence, Template-based methods suggest the use of known three-dimensional structures in the Protein Data Bank (PDB) (Kouranov et al., 2006) as a template (Lee et al., 2009). On the other hand, Ab initio not only use any templates but also builds the 3D models from scratch (Lee et al., 2009). Ab initio modeling predicts protein structures using either physical and chemical principles or other techniques (Dong et al., 2007). However, they have a high computational complexity. One of the most common template-based approaches to predict the 3D structure is the machine learning methods. In this paper, we focus on machine learning methods as well. In this paper, two methods are proposed for protein fold recognition: an individual method and a fusion method. A Two Layer Classification Framework (TLCF) is proposed as individual method. This framework is composed of two layers. In the first layer, we attempt to predict the structural class of protein. The

R.Z. Aram, N.M. Charkari / Journal of Theoretical Biology 365 (2015) 32–39

classifier in the first layer classifies the instances into four structural classes: all alpha, all beta, alpha/beta, and alpha þbeta. Then, we add the classification results of the first layer as a new feature to train and test datasets. Using the results of the first layer, another classifier is employed for predicting 27 folding classes in the second layer. To improve the prediction accuracy, TLCFn is proposed by introducing novel fusion system. Generally, fusion system is the combination of individual classifiers and operates on classifiers outputs. The outputs of all individual classifiers will combine using different techniques such as voting rule. As discussed in a comprehensive review article (Chou, 2011) and followed up by a series of recent publications (Liu et al., 2014; Qiu et al., 2014; Guo et al., 2014; Ding et al., 2014; Xu et al., 2014), in order to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform evaluation method to objectively evaluate the anticipated accuracy of the predictor; (v) efforts to establish a user-friendly web-server for the predictor that be accessible to the public. Below, after introducing related work, we describe how to deal with these steps one-by-one.

2. Related work Support vector machines (SVMs) and neural networks (NNs) are two interesting methods for protein fold recognition. Ding and Dubchak (2001) proposed the Unique One- versus-Others (uOvO) and the all-versus-all methods. They employed SVM and three layers feedforward NNs as base classifiers. Yang et al. (2008) applied the three types of classifiers: k nearest neighbors, class center and nearest neighbor, and probabilistic neural networks. Then the results of the mentioned classifiers were combined using an ensemble voting system. Ensemble classifiers are frequently used in protein fold recognition. Ensemble method is a supervised learning algorithm that uses multiple classifiers to obtain proper prediction accuracy. Guo and Gao (2008) presented two-layer ensemble classifier. In the first layer, a potential class index for every query protein in the 27-folds is identified. According to this result, a 27-dimension vector is generated in the second layer. Finally, genetic algorithm is adopted to obtain weights for the outputs of the second layer to get the final result. Kavousi et al. (2012) proposed the method upon which an unknown query protein is assigned to a hyperfold rather than a single fold. Each hyper_fold is a set of interlaced folds with a centroid fold and Dempster rule has been used to combine the results. Nanni (2006a, 2006b) proposed ensemble of classifiers, and applied it to protein fold recognition. Another ensemble method applied by Hashemi et al. for protein fold pattern recognition (Hashemi et al., 2009). They could improve the prediction accuracy by using Bayesian Ensemble of RBFN. In Bayesian Ensemble, the normalized confusion matrix of each base classifier (i.e. RBFN) is used to ensemble the outputs. Chmielnicki and Stapor (2012) suggested a hybrid discriminative/generative approach. Accordingly, they utilized RDA (as a generative classifier) and SVM (as a discriminative classifier). Using the results of RDA, SVM classifies the proteins. Abbasi et al. (2013) made use of an intelligent hyper framework. The existing components in the framework are used to classify proteins under fuzzy conditions. A novel approach named PFP-FunDSeqE is proposed by Shen and Chou (2009). Accordingly, the functional domain information and the sequential evolution

33

information of proteins are combined through a fusion ensemble classifier. PFP-FunDSeqE have improved the prediction rate by fusing five features extracted by Ding and Dubchak and four Pseudo-amino Acid Composition extracted by Chou (2005). Shen and Chou (2006) proposed a different ensemble classifier named PFP-Pred. This ensemble classifier uses Evidence-Theoretic K-Nearest Neighbor (ET-KNN) as base classifier. The evidence-theoretic k-nearest neighbor is a classification method based on the Dempster–Shafer theory (see (Shen and Chou, 2005) for more details). The ET-KNN has been carrying out separately on nine feature sets and nine outputs generated. Finally, the outputs were combined using weighted voting. Jazebi et al. (2009) employed a fusion method for fold pattern recognition. They used the Probabilistic Neural Network (PNN) as base classifier in the fusion method. The fusion method has combined the classification results on six different feature sets by using the weighted voting approach. Leon et al. (2009) presented a taxonometric approach based on different classification techniques such as k-nearest neighbor (K-NN), decision trees, Naive Bayes and neural networks (NNs). In their experiments, they found that the neural network and K-NN have better performance than other techniques. A Multi-Objective Feature Analysis algorithm is proposed in Shi et al. (2004). The objective of this algorithm is to simultaneously selecting the effective features, improving the accuracy and providing bias information of test and train data. To achieve this objective, authors employed an extended wrapper method for feature selection and used SVM for classification task. However, the method suffers from high complexity time. Huang et al. (2003) proposed a Hierarchical Learning Architecture (HLA) that works in two levels. In the first level, four structural classes (all alpha, all beta, alpha/beta, and alpha þbeta) are predicted while in the next level, protein features are classified into 27 folds. The main weakness of HLA is that if the classifier in level 1 makes any mistake, then the classifiers in level 2 will not be able to recover the mistakes. The proposed method in this paper uses two levels of classification like HLA (Huang et al., 2003). However, it is different from various aspects that are discussed in relevant section.

3. The dataset and feature vectors 3.1. Training and test datasets To compare our method with previous works, the dataset that were introduced in (Ding and Dubchak, 2001) has been used. It contains 313 instances in the training set and 385 instances in the testing set. In training set, two proteins have no more than 35% of the sequence identity for the aligned subsequences longer than 80 residues. The testing dataset of 385 proteins is composed of protein sequences of less than 40% identity with each other. These datasets contain the 27 most populated folds represented by seven or more proteins and corresponding to four major structural classes: α, β, α/β and α þ β. The folds in the dataset and the corresponding number of proteins in two datasets are shown in Table 1. 3.2. Feature vectors Ding and Dubchak represented the samples based on primary protein sequences. Six features were extracted independently from protein sequences: Amino acids composition (C), predicted secondary structure (S), hydrophobicity (H), normalized van der Waals volume (V), polarity (P), and polarizability (Z). C is the sequence composition of 20 types of amino acids (see (Ding and Dubchak, 2001) for more details). C has the dimensionality of 20

34

R.Z. Aram, N.M. Charkari / Journal of Theoretical Biology 365 (2015) 32–39

Table 1 Protein folds and structural classes in Ding and Dubchak dataset. Fold

Index

α Globin-like Cytochrome c DNA-binding 3-helical bundle 4-helical up-and-down bundle 4-helical cytokines Alpha; EF-hand β Immunoglobulin-like β-sandwich Cupredoxins Viral coat and capsid proteins ConA-like lectins/glucanases SH3-like barrel OB-fold Trefoil Trypsin-like serine proteases Lipocalins α/β (TIM)-barrel FAD(also NAD)-binding motif Flavodoxin-like NAD(P)-binding Rossmann- fold P-loop containing nucleotide Thioredoxin-like Ribonuclease H-like motif Hydrolases Periplasmic binding protein- like α þβ β-grasp Ferredoxin-like Small inhibitors,toxins,lectins Total

Number of train set

Number of test set

1 3 4 7 9 11

13 7 12 7 9 7

6 9 20 8 9 9

20 23 26 30 31 32 33 35 39

30 9 16 7 8 13 8 9 9

44 12 13 6 8 19 4 4 7

46 47 48 51 54 57 59 62 69

29 11 11 13 10 9 10 11 11

48 12 13 27 12 8 14 7 4

72 87 110

7 13 14 313

8 27 27 385

Table 2 Protein fold features in Ding and Dubchak dataset. Symbol

Feature

Dimension

C S H P V Z CS CSH CSHP CSHPV CSHPVZ

Amino acid composition Predicted secondary structure Hydrophobicity Polarity Normalized van der waals volume polarizability Combination of C&S Combination of C&S&H Combination of C&S&H&P Combination of C&S&H&P&V Combination of C&S&H&P&V&Z

20 21 21 21 21 21 41 62 83 104 125

and the remaining features are 21-Dimensional. Hence, each protein can be defined as a vector of 125 features. Six features and their combination are shown in Table 2. 4. Supervised learning

Titterington, 1994). These neurons consist of sets of adaptive weights, and work together to achieve a certain target function. NNs are adaptive systems that changes their structures based on the information that flows in the learning phase. Tolerance to noise effects is one of the advantages of neural networks. Multilayer perceptron (MLP) and radial basis function network (RBFN) are two types of neural network algorithms which are described in the following sections. 4.1.1. Multilayer perceptron Multilayer perceptron (MLP) is a feed-forward neural network (Gardner and Dorling, 1998). In feed-forward neural network, information moves in only one direction and there is no loop in the network. A MLP consists of multiple layers of nodes that are connected such a way that shown in Fig. 1. Each circle in Fig. 1 is a neuron with a nonlinear activation function. MLP uses back propagation algorithm to train the multilayer perceptron. The objective of back propagation algorithm is to find the combination of weights which result in the smallest error (Gardner and Dorling, 1998). 4.1.2. Radial basis function network RBFN implements a Gaussian radial basis function network (Moody and Darken, 1989). It uses the k-means clustering to provide the basis functions and symmetric multivariate Gaussians are fit to the data from each cluster. RBF networks have advantages of strong tolerance to noise and good generalization (Hao et al., 2011). 4.2. Rotation forest Rotation Forest is an ensemble classifier based on feature extraction. The feature set of dataset is randomly split into K subsets (Rodriguez et al., 2006). Then, principal component analysis is applied to each subset and a base classifier used for training on each one. All principal components are retained in order to preserve the variability information in the data (Rodriguez et al., 2006). All classifiers work in parallel, and the results of all base classifiers will combine using majority voting. The main advantage of rotation forest is to encourage individual accuracy and diversity within the ensemble.

5. Proposed methods 5.1. Two-layer classification framework As can be seen in Table 1, the folds in dataset are separated into four structural classes: α, β, α/β, and α þ β. Within each structural class, it can be further categorized into several different numbers of folds. According to the protein dataset characteristics, we proposed a two layer method for protein fold recognition. TwoLayer Classification Framework (TLCF) includes two layers. In the first layer, the structural class of protein is predicted. Accordingly,

Supervised learning is the machine learning method in which the model is learned from labeled data (Mohri et al., 2012). A supervised learner analyzes the training data and produces a model which can be used for predicting the class of test samples (or unknown examples). In the following some supervised algorithms are discussed that we have used in our proposed method. 4.1. Neural networks Neural networks (NNs) are popular supervised algorithms that consist of an interconnected group of neurons (Cheng and

hidden layer#1

hidden layer#2

Fig. 1. A graphical representation of an MLP.

Output

R.Z. Aram, N.M. Charkari / Journal of Theoretical Biology 365 (2015) 32–39

regardless of 27-fold pattern, the classifiers in the first layer of TLCF, classify the instances into four structural classes: all alpha (class#1), all beta (class#2), alpha/beta (class#3), and alpha þ beta (class#4). Three types of classifiers are used in this layer i.e. Rotation Forest, ordinary MLP and SVM (with polynomial kernel). These classifiers work independently and predict the structural class for each instance. Finally, weighted majority voting would be employed for combining the results of classifiers. To predict an unknown example x, the majority voting assigns x to the most represented among classifiers outputs (Kuncheva, 2003). Suppose that outputs of the classifiers are considered as c-dimensional binary vector (di1, di2, …, dic)A {0, 1}c, where i¼1, 2, 3 …L, c is the number of classes and L is the number of classifiers. Let dij ¼ 1 if classifier i assigns x to class j, otherwise dij ¼0. The majority voting mechanism will pick class wk if L

∑ dik ¼ max

i¼1

L

ð1Þ

∑ dij

j ¼ 1 to c i ¼ 1

If the classifiers are not identical in terms of accuracy, weighted majority voting rule considers the more competent classifier to final decision. Let weight of each classifier defined as bi, and i¼ 1, 2, 3 …L, in this case, the weighted majority vote will pick class wk if L

∑ bi dik ¼ max

i¼1

L

ð2Þ

∑ bi dij

j ¼ 1 to c i ¼ 1

In our study, bi represents the confidence value of each classifier to predict the structural class of each unknown sample. The confidence value of a class states how certain a sample belongs to this class. Output of the first layer is predicted “structural classes” for the samples. It will be added as a new feature to train and test datasets in the second layer. For example, suppose that the final output of the ensemble system is “all alpha” for sample#1, then the new feature with value of “all alpha” for sample#1 will be added. This procedure is performed for all the training and testing samples. Therefore, a new feature named “Structural classes” is added to the fold features (C, S, H, P, V, and Z). In the second layer, another classifier will be employed which further classifies the instances into 27 folds. TLCF is not restricted to use a specific classifier in the second layer. We have used various classifiers in the second layer in our experiments. By adding the mentioned feature, the experimental results (Section 5) are very encouraging, as the learners in the second layer consider the structural class of each instance. Two layer classification framework is shown in Fig. 2. The input of the first layer is set of 38 selected

35

features as mentioned in (Hor et al., 2009). The selected features in TLCF are listed in Table 3. As mentioned, in HLA (Huang et al., 2003), instances that are incorrectly classified in the first level cannot be recovered in the second level. Even if the value of “structural classes” is wrongly labeled for some instances in classification in the first layer of TLCF, it will be shown that the positive effects of other features lead to correct folding prediction for some of them. Thus, it is one advantage of TLCF compared to HLA. 5.2. TLCFn In this section, TLCFn is presented by introducing new fusion system. TLCFn improves the prediction accuracy by exploiting of both TLCF and HLA methods. A schematic of TLCFn is shown in Fig. 3. It contains three Learning Models that work in parallel. The first model is identical to TLCF one as shown in Fig. 2. The second model is similar to HLA (Huang et al., 2003), But with an additional step. In the third model, TLCFn classify the instances into 27 folds directly without using the two-layer classification. For a given unknown example, each of the models produces an output. Finally, weighted majority voting rule is employed for combining the results of classifiers. To clarify the mechanism of the second model, we describe the model in the following. In HLA, data are categorized into four parts in the second layer. In each part, proteins only belong to one of the structural classes i.e. all alpha, all beta, alpha þbeta or alpha/beta. In other words, to predict the folding, samples are sent to one of the parts according to their predicted structural classes in the first layer. As mentioned, HLA will not be able to correctly predict the folding of instances that are incorrectly classified in the first layer. To solve this problem, we apply an extra step on HLA. Pseudo-code for this step is shown in Fig. 4. In this procedure, after predicting the structural class of unknown sample X in the first layer, the nearest neighbor (based on Manhattan distance) of X be obtained from each categorized data i.e. all alpha (Γ1), all beta (Γ2), alpha/beta (Γ3) and alpha þ beta (Γ4). Manhattan distance between pairs of X¼(x1, x2, …, xn) and Y ¼(y1, y2, …, yn) is obtained as follow: n

dM ðX; Y Þ ¼ ∑ jxi  yi j

ð3Þ

i¼1

In fact, to predict the folding, unknown sample X is given to each of the learned models by Γ1 to Γ4. Then, based on the Manhattan distances, a confidence value (dc) is assigned to the final outputs of the second model of TLCFn for sample x.

Rot Forest

Input

Voting

SVM

Add new feature

Classifier

Fold 1~27

MLP

First layer

Second layer

Fig. 2. Proposed TLCF for protein fold recognition.

Table 3 Selected features. First layer Second layer

C2,C3,C5,C6,C8,C9,C14,C15,C18,S1,S2,S3,S6,S7,S9,S12,S16,H3,H4,H5,H7,H8,H9,H13,H14,H16,H17,H18,H19,V1,V4,V9,V13,P2,P7,P19,Z5,Z18 Original features

36

R.Z. Aram, N.M. Charkari / Journal of Theoretical Biology 365 (2015) 32–39

Final output

Ensemble outputs using weighted majority vote

Fold 1~27

Rot Forest

Fold 1~6

Rot Forest

Data α

Fold 7~15

Fold 16~24

Fold 25~27

Rot Forest

Rot Forest

Rot Forest

Data β

Data α/β

Fold 1~27

Rot Forest

Data α+β

Adding new feature Assign the confidence value based on Manhattan distance

Predicting "Structural Classes" in the First Layer

Input Fig. 3. Proposed TLCFn for protein fold recognition.

Input: training set Γ1, Γ2, Γ3, Γ4 unknown sample X For i=1:4 Yi= nearest neighbor of X in Γi ; = dM (X,Yi); if ==0 then label X=label yi; exit; % All training data will be correctly classified End For i=1:4 in the interval of [0,1]; Adapting = (The adapted ) End Fig. 4. Confidence value for unknown samples.

The above procedure will be repeated for all samples and a confidence value in the interval of [0, 1] will be generated for each sample. Note that, dc only be generated in the second model of TLCFn. The confidence values of the first and third models of TLCFn will be generated only by Rotation Forest. Once mentioned models individually generated their outputs, weighted majority vote (in Eq. (2)) would be employed for combining the outputs of the models to take the final decision. The experimental results (Section 5) show that proposed method has better performance than HLA (Huang et al., 2003), because we managed the samples that are incorrectly classified in the first layer both in the TLCF and the TLCFn.

6. Experimental results We have organized our experiments in two sections. In first section, we examine the performance of the proposed TLCF. In Section 6.2, we have compared the proposed methods with other related studies. As evaluation criteria, we have used the Standard Accuracy (Q), Sensitivity, Specificity and MCC (Matthew's

Correlation Coefficient). However, in the majority of studies related to fold recognition, accuracy is used for evaluation. We define each of the evaluation criteria in the following. Sensitivity is the ratio of correctly classified samples to the total number of samples in each class (Lyons et al., 2014). Sensitivity is defined by equation as follow: Sensitivity ¼

TP TP þ FN

ð4Þ

In the above equation, TP (True Positive) is defined as the number of samples of a specific class that are correctly identified and FN (False Negative) is the number of samples of the same class that are not correctly identified. Specificity is the ratio of correctly rejected samples to the total number of rejected test samples (Lyons et al., 2014). It is defined as follows: Specificity ¼

TN TN þ FP

ð5Þ

where FN and FP represent True Negative and False Positive, respectively. Final criterion which has been used in this study is Matthew's Correlation Coefficient (MCC). The MCC is usually used as a measure of the quality of binary (two-class) classifications (Chen et al., 2013). The MCC calculated as follow: TP  TN  FP  FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ

ð6Þ

Suppose that C represents the total number of test samples that are correctly classified and N is the total number of test samples. Standard Accuracy (Q) is defined by Eq. (7). Q¼

C TP þ TN  100 ¼  100 N TP þ TN þ FP þ FN

ð7Þ

In the first layer of TLCF, the total number of testing samples that correctly classified is 331. Therefore, in the first layer, prediction accuracy for test samples is about 86%.

R.Z. Aram, N.M. Charkari / Journal of Theoretical Biology 365 (2015) 32–39

In data mining, the following methods are often used to evaluate the classification algorithms: independent dataset test, random subsampling, and jackknife test (Chou and Zhang, 1995). However, jackknife test always yield a unique result for a given dataset, as discussed in a series of recent publications (see, e.g. Chou, 2011; Liu et al., 2014; Chen et al., 2014). However, Since in the most related works, on Ding and Dubchak dataset, only independent testing dataset have been used to evaluate the methods, independent testing dataset (as mentioned in Section 2.1) have been conducted to evaluate the proposed methods in this study. Many other studies such as (Lyons et al., 2014), (Shen and Chou, 2009) and (Sharma et al., 2013) have used some feature extraction techniques and performed the learning process using more effective features. Thus, the comparison is done only with studies that have used the features set as shown in Table 2 in our study. 6.1. Performance of TLCF We have used each of MLP, KNN, RBFN, and Naïve Bayes to classify the proteins into 27 folds, once directly and once by using the TLCF method. So, it will be possible to show how adding a new feature (structural class) may improve the final results. Table 4 presents the experimental results. The following results have been obtained by data mining toolkit WEKA version 3.6.10. In Table 4, firstly, each of the classifiers is applied separately to the Table 4 Prediction accuracy of TLCF with different classifiers. Features

C C þ “structural class” S S þ“structural class” H H þ “structural class” P P þ “structural class” V V þ “structural class” Z Z þ “structural class”

Classifier MLP

KNN (K¼ 1) Naïve Bayes RBFN (# cluster¼ 27)

43.63 49.09 34.28 38.18 30.64 42.60 29.87 36.88 29.10 41.30 28.57 37.14

45.45 47.27 37.40 40.51 35.32 41.04 33.00 38.44 34.80 40.26 32.72 37.66

44.67 48.31 32.20 34.80 28.83 38.44 29.87 38.44 27.80 37.66 27.27 33.76

47.80 49.87 37.40 40.78 35.32 42.86 33.24 42.60 33.76 41.30 34.02 42.08

37

original features and the prediction accuracy is obtained. Then, new feature i.e. “Structural classes” would be added to the original features and similar process be employed. As can be seen in Table 4, the results have been significantly improved by adding the feature “Structural classes”. Results are very encouraging for the six datasets. It is necessary to mention that the value of “Structural classes” is incorrect for some samples (samples that incorrectly classified in the first layer). Therefore, improving the accuracy in the first layer leads to the increase of the overall accuracy.

6.2. Comparison with other methods In this experiment, we have used Rotation Forest classifier in the second layer of TLCF for proteins classification into 27 folds. We have applied the TLCF on features C, S, H, P, V, and Z. The results are shown in Table 5. Furthermore, Sensitivity, Specificity and MCC are reported in Fig. 5. As can see from Table 5, we have obtained better accuracy rate in comparison with other studies for all feature sets. In Fig. 5, the average measures of Sensitivity, Specificity and MCC of features C, S, H, P, V, and Z are calculated. In this regard, the sensitivity, specificity and MCC are computed for each class, and then the average values of the mentioned measures are individually computed and presented in Fig. 5. As can be seen from Fig. 5, specificity is about 0.98 for all the feature sets. In feature C, the sensitivity and MCC values are larger than other features. However the values of sensitivity and MCC are more than 0.40 for all feature sets. MCC is a balanced measure which can be used in cases where classes are very different sizes. MCC returns a value between intervals of [  1, þ1]. Value of þ1 for the MCC indicates a perfect prediction, 1 indicates total disagreement between predicted classes and actual classes, and MCC close to zero, means that predictor behaves almost randomly. With these descriptions, the results of MCC obtained by our models are satisfying for this study. As mentioned in Section 5.2, we use ensemble (fusion) of Rotation Forest in TLCFn for classifying the proteins into 27 folds. Thus, we compare the TLCFn with ensemble and hybrid methods. The results are presented in Table 6. The best prediction accuracy of TLCFn is 65.71% that is better than other methods. This accuracy was obtained by ensemble of all feature sets. TLCFn contains three models. Each of models has some errors in predicting the folds. TLCFn reduces errors by combining the three

Table 5 Comparison between TLCF with other related papers in terms of prediction accuracy. Method

uOvO SVM (Ding and Dubchak, 2001) AvA SVM (Ding and Dubchak, 2001) CCNNa (Yang et al., 2008) HLAb – RBFN (Huang et al., 2003) MOFc (Shi and Suganthan Deb, 2004) Fusion method (Jazebi et al., 2009) Hyperfold approach (Kavousi et al., 2012) taxonometric approach (Leon et al., 2009) FRANd (Abbasi et al., 2013) TLCFe (This paper) TLCFf (This paper) a

Features C

S

H

P

V

Z

49.40 44.90 42.08 44.90 44.50 49.10 – 45.71 51.43 52.21 58.18

– – 35.84 – 38.50 39.70 44.10 39.22 34.55 48.05 50.90

– – 32.21 – 37.50 35.50 36.03 35.32 40.00 49.09 51.42

– – 27.01 – 35.50 36.80 35.51 33.77 38.70 42.86 51.42

– – 33.77 – 36.50 36.00 31.59 34.81 41.04 46.50 51.42

– – 29.87 – 35.00 33.70 27.15 32.73 38.96 44.16 51.16

Class center and nearest neighbor. Hierarchical learning architecture. c Multi-objective feature analysis. d Fuzzy resource-allocating network. e Rotation forest with 10 iteration is used in the second layer. f Rotation forest with 100 iteration is used in the second layer. b

38

R.Z. Aram, N.M. Charkari / Journal of Theoretical Biology 365 (2015) 32–39

appropriate feature selection mechanism is the drawbacks of the proposed methods. By selecting the appropriate features and eliminating redundant features, we can simultaneously improve runtime of algorithm as well as the accuracy of prediction. Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful prediction methods or models (Chou and Shen, 2009; Lin and Lapointe, 2013), we plan to provide a web-server for our method as done in (Liu et al., 2014; Qiu et al., 2014; Guo et al., 2014; Ding et al., 2014; Xu et al., 2014) in our future work. References

Fig. 5. Sensitivity, Specificity and MCC of all feature sets.

Table 6 Comparison between TLCFn with other related papers in terms of prediction accuracy. Method

Best result

Ensemble of RBFN (Hashemi et al., 2009) Ensemble of OET-KNNa (Shen and Chou, 2006) Ensemble of RSb and HKNN (Nanni, 2006a,b) Ensemble of HKNN (Nanni, 2006a,b) Ensemble of KNN, CCNN, and NNs (Yang et al., 2008) Ensemble of KNN (Guo and Gao, 2008)c hybrid SVM-RDA (Chmielnicki and Stapor, 2012) TLCFn – Ensemble of rotation forest (This paper)

46.24 62.10 60.30 61.10 63.12 63.70 62.60 65.71

a b c

Optimized evidence-theoretic KNN. Random subspace method. Majority voting is used as the ensemble strategy.

mentioned models. It is necessary to mention that although TLCFn improved the prediction rate about 2% higher than other related method, it is extremely difficult task to enhance the success rate even by 1% or 2% as discussed in other related works as well (Shen and Chou, 2009). The low number of training samples and the large number of classes makes it difficult to learn a good model for protein fold recognition.

7. Conclusion In this paper, we present a framework to protein fold recognition based on a two-layer structure. Unlike many other studies, the framework is not restricted to use a specific classifier. By applying the stronger classifiers in the framework, the prediction accuracy would be improved. For example, using SVM in the second layer and finding the appropriate parameters in RBF kernel may also be effective in improving the performance. Use of other ensemble classifiers such as Adaboost, which are suitable for high dimensional problems, also can be considered in future works to improve the classification rate. We have used a divide and conquer approach for protein fold recognition, since a four-class problem is solved instead of solving a 27-class problem directly. The prediction accuracy in the second layer strongly depends on the results of the first layer. Thus, by improving the accuracy of the first layer, the prediction accuracy will be increased in the second layer. The experimental results indicate that the proposed framework is very effective for protein fold recognition. The methods can improve the prediction accuracy by 2–10% on Ding and Dubchak datasets. In TLCF method, accuracy is higher than other studies for all feature sets. The best prediction accuracy 65.71% was obtained by ensemble of all feature sets in TLCFn. However, failure to use an

Abbasi, E., Ghatee, M., Shiri, M.E., 2013. FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds. Comput. Biol. Med. 43 (9), 1182–1191. Chen, W., et al., 2013. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res., gks1450. Chen, W., et al., 2014. iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. BioMed Res. Int. 2014. Cheng, B., Titterington, D.M., 1994. Neur. Netw.: Rev. Statist. Perspect. (1), 2–30. Chmielnicki, W., Stapor, K., 2012. A hybrid discriminative/generative approach to protein fold recognition. Neurocomputing 75 (1), 194–198. Chou, K.-C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21 (1), 10–19. Chou, K.-C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273 (1), 236–247. Chou, K.-C., Shen, H.-B., 2009. Review: recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 1 (02), 63. Chou, K.-C., Zhang, C.-T., 1995. Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30 (4), 275–349. Ding, C.H.Q., Dubchak, I., 2001. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17 (4), 349–358. Ding, H., et al., 2014. iCTX-Type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed Res. Int. 2014. Dong, Q.-w., Wang, X.-l., Lin, L., 2007. Methods for optimizing the structure alphabet sequences of proteins. Comput. Biol. Med. 37 (11), 1610–1616. Gardner, M.W., Dorling, S.R., 1998. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmos. Environ. 32 (14–15), 2627–2636. Guo, S.-H., et al., 2014. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics, btu083. Guo, X., Gao, X., 2008. A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng. Des. Sel. 21 (11), 659–664. Hao, Y., et al., 2011. Advantages of radial basis function networks for dynamic system design. IEEE Trans. Ind. Electr. 58 (12), 5438–5450. Hashemi, H.B., Shakery A. , Naeini M.P., 2009. Protein fold pattern recognition using Bayesian ensemble of RBF neural networks. In: Proceedings of the International Conference of Soft Computing and Pattern Recognition, 2009, SOCPAR'09. Hor, C.-Y., ShiauS.-H., Yang C.-B., 2009. Feature selection and combination methods for protein fold classification, In: Proceedings of the 14th Conference on Artificial Intelligence and Applications, Taichung, Taiwan. Huang, C.D., Lin, C.T., Pal, N.R., 2003. Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. IEEE Trans. Nanobiosci. 2 (4), 221–232. Jazebi, S., Tohidi A., Rahgozar M., 2009. Application of classifier fusion for protein fold recognition. In: Proceedings of the Sixth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD'09. Kavousi, K., et al., 2012. Evidence theoretic protein fold classification based on the concept of hyperfold. Math. Biosci. 240 (2), 148–160. Kouranov, A., et al., 2006. The RCSB PDB information portal for structural genomics. Nucleic Acids Res. 34 (suppl 1), D302–D305. Kuncheva, L.I., 2003. “Fuzzy” versus “nonfuzzy” in combining classifiers designed by boosting. IEEE Trans. Fuzzy Syst. 11 (6), 729–741. Lee, J., Wu, S., Zhang, Y., 2009. Ab initio protein structure prediction, From Protein Structure to Function With Bioinformatics. MIT Press, Springer, pp. 3–25. Leon, F., AignatoaieiB.I., Zaharia M.H., 2009. Performance analysis of algorithms for protein structure classification. In: Proceedings of the 20th International Workshop on Database and Expert Systems Application, 2009, DEXA'09. Lin, S.-X., Lapointe, J., 2013. Theoretical and experimental biology in one—a symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers. J. Biomed. Sci. Eng. 6, 435. Liu, B., et al., 2014. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 30 (4), 472–479. Lyons, J., et al., 2014. Protein fold recognition by alignment of amino acid residues using kernelized dynamic time warping. J. Theor. Biol. 354 (0), 137–145. Mohri, M., Rostamizadeh, A., Talwalkar, A., 2012. Foundations of Machine Learning. MIT Press. Moody, J., Darken, C.J., 1989. Fast learning in networks of locally-tuned processing units. Neural Comput. 1 (2), 281–294.

R.Z. Aram, N.M. Charkari / Journal of Theoretical Biology 365 (2015) 32–39

Nanni, L., 2006a. Ensemble of classifiers for protein fold recognition. Neurocomputing 69 (7–9), 850–853. Nanni, L., 2006b. A novel ensemble of classifiers for protein fold recognition. Neurocomputing 69 (16–18), 2434–2437. Qiu, W.-R., Xiao, X., Chou, K.-C., 2014. iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int. J. Mol. Sci. 15 (2), 1746–1766. Rodriguez, J.J., Kuncheva, L.I., Alonso, C.J., 2006. Rotation forest: a new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 28 (10), 1619–1630. Sharma, A., et al., 2013. A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J. Theor. Biol. 320 (0), 41–46. Shen, H., Chou, K.-C., 2005. Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. Biochem. Biophys. Res. Commun. 334 (1), 288–292.

39

Shen, H.-B., Chou, K.-C., 2006. Ensemble classifier for protein fold pattern recognition. Bioinformatics 22 (14), 1717–1722. Shen, H.-B., Chou, K.-C., 2009. Predicting protein fold pattern with functional domain and sequential evolution information. J. Theor. Biol. 256 (3), 441–446. Shenoy, S.R., Jayaram, B., 2010. Proteins: sequence to structure and function— current status. Curr. Protein Pept. Sci. 11 (7), 498–514. Shi, S.Y., SuganthanP.N., DebK.,2004. Multiclass protein fold recognition using multiobjective evolutionary algorithms. In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB'04. Xu, Y., et al., 2014. iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. Int. J. Mol. Sci. 15 (5), 7594–7610. Yang, M.Q., et al., 2008. Ensemble voting system for multiclass protein fold recognition. Int. J. Pattern Recognit. Artif. Intell. 22 (04), 747–763.

A two-layer classification framework for protein fold recognition.

Protein fold recognition is one of the interesting studies in bioinformatic to predicting the tertiary structure of proteins. In this paper, an indivi...
581KB Sizes 0 Downloads 4 Views