Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification.

Biotechnol Lett DOI 10.1007/s10529-014-1577-3

ORIGINAL RESEARCH PAPER

Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification LiQiang Wang • CuiFeng Li

Received: 26 March 2014 / Accepted: 28 May 2014 Ó Springer Science+Business Media Dordrecht 2014

Abstract A genetic algorithm (GA) coupled with multiple linear regression (MLR) was used to extract useful features from amino acids and g-gap dipeptides for distinguishing between thermophilic and non-thermophilic proteins. The method was trained by a benchmark dataset of 915 thermophilic and 793 non-thermophilic proteins. The method reached an overall accuracy of 95.4 % in a Jackknife test using nine amino acids, 38 0-gap dipeptides and 29 1-gap dipeptides. The accuracy as a function of protein size ranged between 85.8 and 96.9 %. The overall accuracies of three independent tests were 93, 93.4 and 91.8 %. The observed results of detecting thermophilic proteins suggest that the GA-MLR approach described herein should be a powerful method for selecting features that describe thermostabile machines and be an aid in the design of more stable proteins. Keywords Feature selection Genetic algorithm g-gap dipeptide Multiple linear regression Protein thermostability

L. Wang C. Li (&) Department of Biochemistry and Molecular Biology, College of Life Science, Nankai University, Weijin Road 94, Tianjin 300071, China e-mail: [email protected] L. Wang e-mail: [email protected]

Introduction Protein thermostability is an important aspect of protein engineering and biotechnological research (Bommarius et al. 2006). To design stable proteins, numerous investigations have been carried out to understand the chemical properties influencing thermophilic protein stability. Sadeghi et al. (2006) revealed that the hydrogen bonds, salt bridges and ion pairs could enhance protein stability. Furthermore, protein stability depends linearly on chain length (Ghosh and Dill 2009) and protein rigidity (Radestock and Gohlke 2008). In addition, the composition of hydrophobic, charged and aromatic amino acids in thermophilic proteins is in higher abundance compared with the numbers found in mesophilic proteins (Zhou et al. 2008). Dipeptides have also been observed differences between thermophilic and mesophilic proteins (Zhang and Fang 2006a). To determine a given protein’s thermostability, Zhang and Fang (2006a, 2007) developed many methods based on amino acid composition (AAC) and dipeptide composition (DC). The fivefold crossvalidation accuracy of their methods was 82.7–87.4 %. Gromiha and Suresh (2008) removed the redundancy of Zhang and Fang’s dataset and improved the accuracy to 89.4 %. Zuo et al. (2013) proposed a more robust classifier of KNN-ID and achieved a Jackknife test accuracy of 91.02 %. To further improve the accuracy, Lin and Chen (2011)

123

Biotechnol Lett

used the analysis of variance (ANOVA) technique for selecting attributes from g-gap dipeptides. They obtained the Jackknife test accuracy of 93.27 % using the support vector machine (SVM). Nakariyakul et al. (2012) proposed an algorithm of improved forward floating selection (IFFS) to select attributes from the combination of AAC and DC. They obtained Jackknife test accuracy of 93.3 %. Although numerous innovative methods have been proposed, the accuracy of detecting thermophilic proteins still requires improvement. In this work, the genetic algorithm (GA) coupled with multiple linear regression (MLR) was developed for detecting thermophilic proteins based on the extraction of features from the 20 amino acids and g-gap dipeptides (g = 0, 1). Based on the selected 76 features, the Jackknife test accuracy of MLR discriminating functions reached 95.43 %, which is higher than the results obtained by SVM (Lin and Chen 2011; Nakariyakul et al. 2012). We also successfully validated the detection capability of the MLR discriminating functions by different sequence length tests and independent tests. The significance of the selected features and their functional implications are discussed.

at 40 %. The final dataset contained 915 thermophilic proteins and 793 non-thermophilic proteins, which is deemed to be more reliable than previous work (Gromiha and Suresh 2008; Zhang and Fang 2007). Computation of AAC and g-gap DC The AAC of each protein with N amino acids was defined as: CompðiÞ ¼

ni 100%; N

1 i 20

The i defines the 20 types of amino acids and ni is the number of residues of each type of amino acids. The g-gap DC of each protein with N amino acids was defined as: , 20 X 20 X g Compði; jÞ ¼ nij ngij 100% i¼1 j¼1

¼

ngij

Ng1 g ¼ 0; 1; 2. . .

100%;

The ngij defines the 400 types of g-gap dipeptides, and g is the residue number between residue i and j. Evaluation of the performance

Materials and methods Datasets The benchmark dataset is from Lin and Chen (2011), containing protein sequences of 136 prokaryotic organisms (17 archaea and 119 bacteria). Generally, proteins of psychrophiles are considered as psychrophilic, mesophiles as mesophilic, thermophiles as thermophilic and hyperthermophiles as hyperthermophilic. In this work, proteins will be grouped as either thermophilic or non-thermophilic. Thermophilic proteins are from organisms that have 60 °C as the lower limit of their optimal growth temperature. Nonthermophilic proteins are from organisms with 30 °C as the upper limit of their optimal growth temperature. Proteins were manually annotated and reviewed. Fragments from proteins or proteins derived from prediction or homology were excluded. Sequences containing ambiguous residues (such as ‘‘X’’, ‘‘B’’ and ‘‘Z’’) were also removed. Furthermore, redundant sequences were removed with the cutoff set

123

The performance of our method is determined by measuring the sensitivity (Sn), specificity (Sp) and accuracy (Acc). These parameters are calculated as follows. Sn ¼ TP=ðTP þ FNÞ Sp ¼ TN=ðTN þ FPÞ Acc ¼ ðTP þ TNÞ=ðTP þ FN þ FP þ FNÞ Here, TP, FP, TN and FN refer to the number of true positives (thermophilic proteins identified as thermophilic ones), false positives (non-thermophilic proteins identified as thermophilic ones), true negatives (nonthermophilic proteins identified as non-thermophilic ones) and false negatives (thermophilic proteins identified as non-thermophilic ones), respectively.

The process of feature selection The GA coupled with MLR was used to mine useful information from amino acids and g-gap dipeptides

Biotechnol Lett

for detecting thermophilic proteins. The 20 amino acids combined with 400 0-gap dipeptides were initially analyzed for selecting a subset of features with higher accuracy. When the accuracy of MLR discriminating functions did not increase, the 1-gap dipeptides containing the selected amino acids were extracted. The extracted 1-gap dipeptides and the selected features in the previous step were then combined for further feature selection (FS). The FS process was terminated when the accuracy did not increase further. The final features may represent the subset with the best possible accuracy. The details of the GA-MLR algorithm are described as follows. GA imitates the natural evolutionary phenomenon. It is a general-purpose search method that simultaneously explores and exploits the search space (Mahmoudabadi et al. 2009). In this study, GA, as a toolbox of MATLAB, was downloaded from the University of Sheffield, UK (www.acse.dept.shef.ac.uk/cgi-bin/gatbx-download). Parameters of GA optimization included a population size of 100, a crossover probability value of 0.7 and a mutation probability value of 0.01. In a generation of GA, the length of all chromosomes equaled the number of features. Each chromosome, composed of zero and one, represented a group of selected features; i.e., one and zero indicated the presence and absence of a feature, respectively. Before the training and testing of MLR discriminating functions, thermophilic proteins were assigned a value of 1 and non-thermophilic proteins were assigned a value of 0. The selected features in each chromosome were then used to train and test the MLR discriminating functions through fivefold cross-validation. After five cycles, MLR discriminating functions calculated the score of all proteins. According to the score, the class of each protein could be determined; i.e., the class was 1 when the score was C0.5, and 0 when the score was \0.5. The overall accuracy was then calculated and considered as the fitness value of each chromosome. Chromosomes, with better fitness values, were selected, crossed and mutated for the next generation. After a certain generation, the best possible subset of features was obtained. In this study, to accelerate the process of FS, redundant features are removed continuously after executing k generations of GA, i.e., the chromosomes are shortened in each Cycle. Furthermore, 20 amino acids coupled with 400 0-gap dipeptides were first filtered. Subsequently, the 1-gap dipeptides containing the selected amino acids were fused into the selected features in the previous step

for further FS. Figure 1 shows the simplified flowchart of the GA-MLR algorithm. Step 0

(Initialization): Initiate the chromosome according to the number of amino acids and the g-gap dipeptides randomly. Calculate the Acc of the initial chromosomes as fitness values. Step 1 (Executing GA-MLR): Execute the GAMLR k generations. If the accuracy increased, the algorithm moves to Step 2, otherwise it advances to Step 3. Step 2 (Accelerating FS): Chromosomes with the best accuracy are extracted and shorten by filtering the value of 0; i.e., redundant features are removed. The shortened chromosome is regarded as the seed chromosome and the algorithm returns to Step 0 for the next Cycle. Step 3 (Adding features): If g \ 2, the next selected g-gap dipeptides are fused into the selected features and the algorithm returns to Step 0, otherwise it advances to Step 4. Step 4 (Transformation): Decode the chromosome into features, i.e., the final selected features. Parameters, Gen and Cycle, count the generation of the GA and the time of removing redundant features respectively, which are set to 1 initially. The k is the total number of generations, determined by the amount of features in each Cycle. The more features, the larger the number of generations used. Although more generations can support adequate selection of the useful features, the calculating time will become longer. To improve the speed of FS and select useful features adequately, the k is set to [100, 200]. To avoid overfitting the MLR detecting functions, the g is only scored the value of 0 and 1. We ran all of our experiments using MATLAB 2013b on an Intel Core i7 computer running at 2.4 GHz and possessing 4 GB RAM.

Results and discussion Feature selection results Multiple linear regression (MLR) discriminating functions were performed by fivefold, tenfold, and Jackknife cross-validation tests. The results are

123

Biotechnol Lett Fig. 1 Schematic representation of the genetic algorithm coupled with multiple linear regression (MLR) for feature selection

Start

Generate initial population

Train and test the MLR by five-cross validation for each chromosome

Select the fittest chromosome and remove redundant features

Fuse the g-gap dipeptides into the selected features as a new subset

Gen = Gen + 1

Gen = k ?

Reproduce population

No

Yes Cycle = Cycle + 1

Yes

No

Acc increase ?

g

Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm.

Optimal robust motion controller design using multiobjective genetic algorithm.

An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm.

Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning.

RBRIdent: An algorithm for improved identification of RNA-binding residues in proteins from primary sequences.

Feature Subset Selection for Cancer Classification Using Weight Local Modularity.

An Efficient Feature Subset Selection Algorithm for Classification of Multidimensional Dataset.

Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets.

GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets.

Carcinoma of unknown primary: identification of a treatable subset?

Optimization of genomic selection training populations with a genetic algorithm.

A fast Newton-Raphson based iterative algorithm for large scale optimal contribution selection.

An open-source genetic algorithm for determining optimal seed distributions for low-dose-rate prostate brachytherapy.

An adaptive genetic algorithm for selection of blood-based biomarkers for prediction of Alzheimer's disease progression.

Identification of thermophilic actinomycetes.

Computationally efficient feature denoising filter and selection of optimal features for noise insensitive spike sorting.

Identification of significant sequence patterns in proteins.

Amino acid composition of proteins: Selection against the genetic code.

Degree of contribution (DoC) feature selection algorithm for structural brain MRI volumetric features in depression detection.

Parameter identification of river water quality models using a genetic algorithm.

Obtaining Thickness Maps of Corneal Layers Using the Optimal Algorithm for Intracorneal Layer Segmentation.

Application of Genetic Algorithm to Predict Optimal Sowing Region and Timing for Kentucky Bluegrass in China.

Microgenetic optimization algorithm for optimal wavefront shaping.

MAC protocol for ad hoc networks using a genetic algorithm.