Computational Biology and Chemistry 56 (2015) 49–60

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Research Article

Genetic Bee Colony (GBC) algorithm: A new gene selection method for microarray cancer classification Hala M. Alshamlan a,∗ , Ghada H. Badr a,b , Yousef A. Alohali a a b

Computer Science Department, King Saud University, Riyadh, Saudi Arabia IRI - The City of Scientific Research and Technological Applications, Alexandria, Egypt

a r t i c l e

a b s t r a c t

i n f o

Article history: Received 6 November 2014 Received in revised form 15 March 2015 Accepted 15 March 2015 Available online 18 March 2015 Keywords: Microarray Gene selection Feature selection Cancer classification Gene expression profile Filter method Artificial Bee Colony ABC MRMR

Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Colony (ABC) is a relatively new swarm intelligence method. In this paper, we propose a new hybrid gene selection method, namely Genetic Bee Colony (GBC) algorithm. The proposed algorithm combines the used of a Genetic Algorithm (GA) along with Artificial Bee Colony (ABC) algorithm. The goal is to integrate the advantages of both algorithms. The proposed algorithm is applied to a microarray gene expression profile in order to select the most predictive and informative genes for cancer classification. In order to test the accuracy performance of the proposed algorithm, extensive experiments were conducted. Three binary microarray datasets are use, which include: colon, leukemia, and lung. In addition, another three multi-class microarray datasets are used, which are: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm are compared with our recently proposed technique: mRMR when combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared the combination of mRMR with GA (mRMR-GA) and Particle Swarm Optimization (mRMR-PSO) algorithms. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature, using all benchmark datasets. The GBC algorithm shows superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in both binary and multi-class cancer classification. © 2015 Elsevier Ltd. All rights reserved.

1. Background DNA microarray technology has had a tremendous impact on cancer research. Microarray gene expression data has been widely used to identify cancer biomarkers or gene signatures, which could complement conventional histopathologic evaluation to increase the accuracy of cancer diagnosis and classification (Simon, 2009). It could also improve our understanding of the cause of cancer for the discovery of new therapy (Alba et al., 2007). However, there are two important issues in microarray classification. First, the datasets are usually complex and noisy. Second, currently available datasets typically contain fewer than one hundred instances, though each instance quantifies the expression levels of several thousands of genes (i.e., high dimensionality). Due to the high dimensionality and the small sample size of the experimental data, traditional classification methods cannot

∗ Corresponding author. Tel.: +966 504427599. E-mail addresses: [email protected] (H.M. Alshamlan), [email protected], [email protected] (G.H. Badr), [email protected] (Y.A. Alohali). http://dx.doi.org/10.1016/j.compbiolchem.2015.03.001 1476-9271/© 2015 Elsevier Ltd. All rights reserved.

be effectively applied to gene expression classification (Alba et al., 2007) because their classification accuracy is quite poor. Therefore, feature construction and gene selection have been applied to gene expression data in order to overcome the high-dimensionality problem (Ghorai et al., 2010) (Sheng-Bo et al., 2006). Many supervised machine learning algorithms such as, neural networks, Bayesian networks, and support vector machines (SVMs), combined with gene selection techniques, have been previously applied to microarray gene expression (Alshamlan et al., 2014). Gene selection is the process of selecting the smallest subset of informative genes that are most predictive to its relative class using a classification model. This maximizes the classifier ability to classify samples accurately. The optimal feature selection problem has been shown to be NP-hard (Narendra and Fukunaga, 1977). Therefore, it is more effective to use heuristics approaches, such as natural inspired evolutionary algorithm, to solve this problem. The Artificial Bee Colony (ABC) algorithm that is introduced by is one bio-inspired evolutionary approach that has been used to find an optimal solution in numerical optimization problems. The algorithm is inspired by the behavior of honeybees when seeking a quality food source. The performance of ABC algorithms has been

50

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

compared with other evolutionary methods such as genetic algorithm (GA), differential evolution (DE), evolution strategies (ES), particle swarm optimization, and particle swarm-inspired evolutionary algorithm (PS-EA) (Karaboga and Akay, 2009; Karaboga and Basturk, 2007, 2008). Numerical comparison results showed that the ABC algorithm is competitive. Due to its simplicity and ease of implementation, the ABC algorithm has captured much attention and has been applied to solve many practical optimization problems. Therefore, in this paper, we propose the application of the ABC algorithm to select the predictive and informative genes from microarray gene expression profile. Naturally inspired evolutionary algorithms such as GA, PSO, and ABC are more applicable and accurate than the wrapper gene selection method (Alshamlan et al., 2014) because they are capable of searching for optimal or near-optimal solutions on complex and large spaces of possible solutions. Furthermore, they allow searching the solution space by considering multiple interacting attributes simultaneously, rather than by considering one attribute at a time (Alshamlan et al., 2014). However, similar to other evolutionary algorithms, the ABC algorithm also faces some challenging problems, especially in computational efficiency, when it is applied to complex and high dimensional data such as microarray datasets. In our previous algorithm, mRMR-ABC (Alshamlan et al., in press), in order to solve these problems and further improve the performance of the Artificial Bee Colony (ABC) algorithm, we adopted a filtering method, minimum redundancy maximum relevance (mRMR), as a preprocessing step to reduce the dimensionality of microarray datasets. For all metaheuristic population-based optimization algorithms, balance between exploitation and exploration is the determining factor for success (Jatoth and Rajasekhar, 2010). If that balance is not achieved, the algorithm can be (by too much exploitation) prematurely trapped in local optima, as in GAs, or it can (by too much exploration) avoid convergence similar to ABC algorithm. In order to make the most of the advantages of naturally inspired metaheuristic evolutionary algorithms and to eliminate their disadvantages, such as pre-convergence and computational time, hybridization is performed. This paper presents a new hybrid algorithm combining the advantages of the GA and the ABC algorithm, which is proposed for microarray gene selection and classification problems. The proposed algorithm, called the Genetic Bee Colony (GBC) algorithm, was inspired by our previous algorithm (mRMR-ABC) (Alshamlan et al., in press). In the original ABC algorithm (Karaboga, 2005), the exploitation is performed by employed bees and onlooker bees, while scout bees perform exploration. However, the real situation is more complicated: some exploration is actually done by the onlooker bees and the way in which a new candidate solution is generated is of crucial importance Milan (2013). To combine the exploration and exploitation capabilities of the GA and the ABC algorithm, in our proposed algorithm, we adopted four modifications of the original ABC algorithm proposed by Karaboga (2005). First, to reduce the computational time and cost of the ABC algorithm, and because the microarray datasets suffer from high dimensionality, we preprocessed the microarray dataset using an mRMR filter method. One of the main functionalities for employee bees in the ABC algorithm is sharing information about the position of a food source (solution). This functionality is known as information sharing. Unfortunately, most proposed ABC models do not sufficiently consider this important function, causing ineffective optimization problem results (Kıran and Gündüz, 2012). In the original ABC algorithm, at the onlooker bee phase, the higher quality results obtained by employee bees have a higher probability of selection (Karaboga, 2005) (Kıran and Gündüz, 2012). However, the selection of neighbour employed bees and parameters for new candidate solutions

are random. Therefore, in order to benefit from information sharing about the position of food sources (solutions) given by the employee bees, we improved the ABC algorithm by adding to it a uniform crossover operation as a new step in the onlooker phase. The selection of neighbour bees for onlooker bees is conducted according to the quality of the solution found by the employee bees. In this step, the best food source that represents the candidate solution has the highest fitness value, which is denoted as Queen Bee in our algorithm, and random neighbour bees obtained using the probability value of employee bees are subjected to a uniform crossover operation. As result of this operation, the best offspring to be considered as neighbours for the onlooker bees are obtained. Subsequently, we increased the number of scout bees to two rather than one scout bee in the basic ABC algorithm to improve the movement speed by increasing the replacement rate. Finally, in order to achieve a balance between the exploitation and exploration capabilities of the ABC algorithm and to improve its local search and exploitation abilities, we adopted mutation operators from the GA during the replacement process in the exhausted solution at the scout bee phase. In this paper, the efficiency of gene selection techniques was measured using an SVM as a classifier. The SVM was more beneficial than the other classification approaches (Alshamlan et al., 2014). It is challenging to construct a linear classifier to separate the classes of data. SVMs address this problem by mapping the input space into a high-dimensional feature space; they then construct a linear classification decision to classify the input data with a maximum margin hyperplane. SVMs have also been found to be more effective and faster than other machine learning methods, such as neural networks and k-nearest neighbour classifiers (Wang and Gotoh, 2009). In the literature, there are several algorithms for gene selection and cancer classification using microarrays. However, to our knowledge, this is the first attempt to hybridize Genetic and ABC algorithms as a gene selection method for cancer classification problems using microarray gene expression profiling. In addition, We believe that microarray-based multi-class molecular analysis can be an effective tool for cancer biomarker discovery and subsequent molecular cancer diagnosis and treatment. Therefore, using our proposed algorithm, we investigated the multi-class classification of cancer microarray datasets. In contrast to the classification of binary microarray gene expression data with two cancer types, multi-class classification of more than two cancer types is a relatively difficult and less-studied problem. Extensive experiments were conducted in order to evaluate the performance of the proposed algorithm using three binary microarray datasets: colon, leukemia, and lung. Our algorithms was also applied to classify three multi-class microarray datasets: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm were compared with our recently proposed technique: mRMR combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared mRMR combined with GA (mRMR-GA) and mRMR combined with Particle Swarm Optimization (mRMR-PSO) algorithm. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature. When tested using all benchmark datasets, the GBC algorithm showed superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in binary and multi-class cancer classification. The remainder of this paper is organized as follows: The proposed Genetic Bee Colony (GBC) algorithm is explained in Section 2. Section 3 outlines the experimental setup and provides results. Finally, Section 4 concludes our paper.

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

2. Genetic Bee Colony (GBC) algorithm In this section, we introduce the proposed Genetic Bee Colony (GBC) algorithm for the selection of predictive genes from cancer microarray gene expression profiles. GBC is a new hybrid metaheuristic algorithm based on a naturally inspired algorithm: ABC and GA. The aim of our proposed algorithm is to select the more informative genes in order to optimize the accuracy of the SVM classifier. The goal of any metaheuristic algorithm is to find the optimal feasible solution. To achieve this goal, appropriate balance between exploitation and exploration is required. In the original ABC algorithm, the exploration process to find a new solution in optimization search space is good, but the solution exploitation is poor and a long computational time is required to converge and find the optimal solution. While, GA has good exploitation operations (crossover and mutation), it does not have the ability to efficiently explore the optimization search space (Kıran and Gündüz, 2012; Jatoth and Rajasekhar, 2010; Milan, 2013). That is why the GAs suffer from the pre-convergence problem and reach the local optima too quickly. Therefore, in order to achieve balanced exploitation and exploration, realise the advantages of naturally inspired metaheuristic evolutionary algorithms, and to eliminate their disadvantages, such as preconvergence and computational time, our proposed GBC algorithm uses hybridization. In our algorithm, we integrate GA operators with the ABC algorithm, deriving a modified ABC-based algorithm for constrained optimization. In our modified algorithm, GA operators are adopted in the exploitation process in the onlooker bee phase to improve information sharing between employee bees and onlooker bees to find optimal solutions, and in the scout bee phase to enhance the process of replacing the exhausted solution. As illustrated in Fig. 1, our proposed algorithm consists of five phases: the preprocessing phase, the representation and initialization phase, the employee bee phase, the onlooker bee phase, and the scout bee phase. In the following subsections, we will describe each phase and how to apply the algorithm to gene selection and cancer classification using microarray datasets. 2.1. The preprocessing phase In high-dimensional microarray datasets, due to the existence of a set of several thousands of genes, it is inefficient to adopt evolutionary algorithm such as ABC directly to the microarray dataset.

51

In addition, it is difficult and infeasible for a classifier to be trained accurately. Alternative methods should be effectively employed to overcome this difficulty. Therefore, as a first step, mRMR is employed to filter noisy and redundant genes. The mRMR approach, which is a heuristic framework to minimise redundancy, was proposed by Peng et al. (2005). It uses a series of intuitive measures of relevance and redundancy to select promising features for both continuous and discrete datasets. mRMR is a criterion for first-order incremental feature selection, which has been extensively studied in the literature (Javad et al., 2012). In our problem, genes that have both minimum redundancy for input genes and maximum relevancy for cancer classes should be selected when using the mRMR method. Thus, the mRMR method is based on two main metrics: The first is the mutual information between cancer classes and each gene, which is applied to measure the relevancy. The second is the mutual information between every two genes, which is employed to compute redundancy. Fig. 2 presents the mRMR dataset, which contains the indices of the ordered selected genes. The first row represents the maximum relevant and the minimum redundant genes. Let S denote the selected genes, and Rl be the measure of the relevancy of a group of selected genes S that can be defined as follows: Rl =

1  I(Gx , C), |S|

(1)

Gx ∈S

where I(Gx , C) represents the value of mutual information between an individual gene Gx that belongs to S and the cancer class C = {c1 , c2 }, where c1 and c2 denote the normal and tumor classes. When the selected genes have the maximum relevance Rl value, it is possible to have high dependency (i.e., redundancy) between these genes. Hence, the redundancy Rd of a group of selected genes S is defined as: Rd =

1 |S|2



I(Gx , Gy )

(2)

Gx ,Gy ∈S

Where I(Gx , Gy ) is the mutual information between the xth and yth genes that measures the mutual dependency of these two genes. The main purpose of applying the mRMR gene selection method is to find a subset of genes from S with m genes, {xi }that either jointly have the largest dependency on the target class c or have minimal redundancy on the selected gene subset S. Thus, Ping (Peng

Fig. 1. The main phases of the Genetic Bee Colony (GBC) algorithm.

52

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

Fig. 2. An mRMR dataset that contains the gene number that is selected by the mRMR filter approach, where gene numbers are ordered by their relevancy.

et al., 2005) recommended searching balanced solutions through the composite objective. This criterion combines the maximal relevance criterion and the minimal redundancy criterion as follows: max(Rl, Rd) = Rl − Rd.

more relative and less redundant genes as selected by the mRMR approach, which is applied in order to filter unimportant and noisy genes and reduce the computational load for the ABC algorithm and SVM classifier.

(3)

Our goal is to maximize the prediction accuracy and minimize the number of selected genes. Hence, we applied the mRMR method as a preprocessing step to the proposed GBC algorithm to improve the speed and performance of the search. The initial microarray dataset is preprocessed using the mRMR filtering approach. Each gene is evaluated and sorted according to the mRMR. The first topranked genes that provide 100% classification accuracy with the SVM classifier are selected to form a new subset called mRMR dataset, as shown in Fig. 2. The mRMR dataset represents the

2.2. The representation and initialization phase In this paper, we made some modifications to the original ABC algorithm representation to match the microarray gene selection problem. The representation of solution space (foods) for the proposed GBC algorithm when it is applied to a microarray dataset is illustrated in Fig. 3. The GBC algorithm first generates a random solution or initial population of size SN, where SN denotes the size of population or total number of food sources. When applying the

Fig. 3. The representation of food for the proposed GBC algorithm. Food sources are the population of solutions, possible gene groups. Each row of Foods matrix is a particular solution holding D genes indices that are to be optimized. The number of rows of foods matrix equals to the food Number SN.

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

53

Fig. 4. Uniform crossover operation.

GBC algorithm to gene selection for microarray data analysis, as illustrated in Fig. 3, each solution is represented as a group of genes indices that are selected form the mRMR dataset. This is denoted as xij , where i represents a particular solution (i = 1, 2, . . . SN), and each solution is a D-dimensional vector (j = 1, 2, 3, . . . D), where D represents the number of informative genes to be optimized in each solution. Each cell xij represents the corresponding gene index. In gene selection problem, each solution (i.e., subset of selected genes) is associated with the fitness value, which is the classification accuracy using the SVM classifier. The amount of nectar in a food source corresponds to the fitness value of the associated solution in the ABC algorithm. The number of employed bees or onlooker bees is equal to the number of solutions in the population. Each cell xij is randomly initiated using the following equation (Xiang and An, 2013): xij = Lj + rand(0, 1) × (Uj − Lj ),

(4)

where, Uj and Lj are the top limit and the lower limit of the xi variable, respectively, Uj = (Maximum gene index − 1) and Lj = 0, while, rand() is the random numbers function between (0, 1). When the new gene index is identified, its optimization must be calculated based on the fitness function. In our problem, the fitness value fiti is determined according to the solution classification accuracy using the SVM classifier. If the new fitness value is better than the fitness value previously obtained, then the bee leaves the old solution (food source), and moves to the new one; otherwise it retains the old solution. After initialization of the random solution (population), the ABC algorithm starts searching for the optimal solution. In the GBC algorithm, each cycle of the search consists of the remaining three phases until the requirement is met or it reaches the maximum number of cycles. Then, the best predictive gene subset that represents the highest fitness solution is returned, which in our algorithm, is denoted as Queen Bee. 2.3. The employee bee phase In this phase, we send the employee bees into candidate solutions (food sources) and evaluate their fitness (nectar amounts)

using SVM classification accuracy. Thus, the employee bees searching around the solutions (food resource) at xi will search for better gene indices at the new location vi . The new gene index is Identified by the following equation (Xiang and An, 2013):

vij = xij + Rij (xij − xkj ),

(5)

where vi = [vi1 , vi2 . . ., vin ] represents the new gene indices (location vector of the bees), xi = [xi1 , xi2 , . . . xin ] is the current gene / j) is a correct random indices (location vector of the ith bee), k(k = number in [1, SN], and the SN is the number of solutions (artificial bees). Rij is a random number uniformly distributed in [− 1, 1]. The random xij numbers are selected from the microarray gene index using Eq. (4) 2.4. The onlooker bee phase In this study, the crossover operation is used for information sharing between employee and onlooker bees in the optimization search space (hive). The onlooker bees learn the location of the solution (food source) by watching the waggle dance of the employee bees. In our proposed algorithm, the onlooker bees use the location of the best food source, which is considered the solution with highest fitness value, detonated as Queen Bee in our algorithm. It is worth mentioning that in the original ABC algorithm, the onlooker bees do not use this information for neighbour selection. Each onlooker bee randomly selects an employed bee as a neighbour. In our algorithm, we propose a uniform crossover operation-based model for an onlooker bee’s selection of a neighbour. In our model, the Queen Bee and the randomly selected neighbour solution obtained by Eq. (5), which is the neighbour of the employee bee selected according to its property (the fitness rating of employee bee), are subjected to a uniform crossover operation. The randomly selected neighbour solution depends on its winning probability value, which is similar to the roulette wheel selection in the GA, as follows: the possibility Pi of selecting a particular solution (food source) by the onlooker bees is calculated using the following equation: pi =

fit i

SN

j=1

fit i

.

(6)

54

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

Uniform crossover works by treating each gene independently and making a random choice as to which parent it should to be inherited from Eiben and Smith (2003). This is implemented by randomly generating a bit string having the same length as the parents. Thus, D random variables from a uniform distribution over [0, 1] are created. Then we set the Crossover Probability Rate (CPR) at 0.6, which is a controlled parameter in our proposed algorithm that is not adjustable since the algorithm is not very sensitive to its changes. In each position, if the value of the random string is below a CPR, Offspring1 takes the gene index from Parent1 and Offspring2 takes the gene index from Parent2. Otherwise, Offspring1 takes the gene index from Parent2 and Offspring 2 takes the gene index from Parent1. The uniform crossover operation is illustrated in Fig. 4. Subsequently, the best offspring obtained as a result of this operation is considered the suggested solution. It is worth mentioning that in the onlooker phase (exploitation) in the original ABC algorithm, a new solution is generated from the current solution and one random solution so that the new solution is inside the large search space, maximizing the diversity. However, in our algorithm, only points or solutions near the best fitness solution (Queen Bee) are permitted as new solutions. Thus, we reduce the diversity, as may be expected, by reducing the space where the newly generated solution can be. 2.5. The scout bee phase In the original ABC algorithm, if the fitness value associated with a solution is not improved for a limited number of specified trials, then the employee bee becomes a scout to which a random value is assigned for finding the new solution. Notably, this is a mechanism for achieving only exploration in optimization search space, and pulling out the solution, which may be entrapped in some local optimizer due to which its value is not improving. By additionally studying the ABC algorithm (Nebojsa and Milan, 2012), we noticed a deficiency during the solution search process. After a significant number of iterations, when the optimal solution is almost found, scout bees, which perform the exploration process, are no longer useful. This problem can be treated by better adjustment of the exploration and exploitation balance (Zhu and Kwong, 2010). In order to achieve balance between the exploitation and exploration capability of the ABC algorithm and to improve its local search and exploitation ability, in our proposed algorithm, we enhanced the scout bee movement in the basic ABC algorithm by applying two modifications. First, we increased the number of scout bee from one to two to improve the movement speed by increasing the replacement rate. Second, we adopted mutation operators from the GA during the process of replacing the exhausted solution to improve the exploitation process at the later stages of the algorithm. In our algorithm, the first scout bee will work as proposed in the basic ABC algorithm. This scout bee will reset the searching process and randomly explore new area of the solution. Subsequently, a mutation operator takes place in the second scout bee placement. In the second scout bee, we look for a place around the highest fitness solution generated so far, which is denoted as the Queen Bee. Thus, each parameter in Queen Bee (gene index) is mutated with small probability according to Eq. (7). The Mutation Probability Rate (MPR) is set at 0.01, which is also a control parameter in our proposed algorithm that is not adjustable since the algorithm is not very sensitive to its changes. The mutation operation is illustrated in Fig. 5. The mutation process is performed only if the random variable is below the MPR value. ScoutBij = QueenBij + Rij (RandBij − QueenBij ),

(7)

where QueenB is the best solution, i is the ith solution index, and the mutation process is applied to all genes j in the ith index, where j is between [1 and D], RandB is a randomly selected solution obtained

Fig. 5. Mutation operation.

using Eq. (5), and Rij is a random number uniformly distributed in [− 1, 1] The pseudocode for the proposed GBC algorithm is presented in Algorithm 1. Algorithm 1. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37:

Genetic Bee Colony (GBC) algorithm

PreProcessing Phase Preprocessing the microarray dataset using mRMR filter method. Generate new filtered microarray dataset named as (mRMR Dataset), shown in Fig. 2. Representation and Initialization Phase Represent the solution space of GBC algorithm for microarray dataset, shown in Fig. 3. For each employee bee Generate initial solution for each employee be by using Eq. (4) Calculate the fitness value using SVM classification accuracy. Reset the abandonment counter. Do Employee Bee Phase For each employee bee Select a neighbor employee bee randomly. Update the position of employee bee (New Solution) by using Eq. (5). Calculate the Fitness Value of New Solution using SVM classification accuracy. IF Fitness Value of New Solution > Fitness Value of Old Solution THEN Replace old solution with new one and reset the abandonment counter of new solution. ELSE Increase the abandonment counter of the old solution by 1. (Determined Queen Bee) Find the best solution obtained so far, which has highest fitness value (Queen Bee). Onlooker Bee Phase For each onlooker bee Select an employee bee according to its property to be chosen(fitness of employee bee) Select new Randomly Neighbor solution by updating the position using Eq. (5). (Uniform Crossover) For each gene index j in candidate solution i (i.e., for j = 0 to D) Select randomly bit RN between [0, 1] IF RN > 0.6 THEN Offspring1ij = QueenBeeij Offspring2ij = RandomNeighborij ELSE Offspring1ij = RandomNeighborij Offspring2ij = QueenBeeij Calculate the fitness value of Offspring1 and offspring2 using SVM classification accuracy. IF Fitness Value of Offspring1 > Fitness Value of Offspring2 THEN NewSolution = Offspring1

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: 56:

ELSE NewSolution = Offspring2 IF Fitness Value of New Solution > Fitness Value of Old Solution THEN Replace old solution with new one and reset the abandonment counter of new solution. ELSE Increase the abandonment counter of the old solution by 1. (Determined Queen Bee) Find the best solution obtained so far, which is has highest fitness value (Queen Bee). Scout Bee Phase Set the abandonment bee Limit L by 5. Search for abandonment bee. First Scout Bee IF the abandonment counter of bee > L THEN Reset the abandonment counter of bee. Generate a new solution for the employee bee randomly. Second Scout Bee IF the abandonment counter of bee > L THEN Reset the abandonment counter of bee. Generate a new solution by mutate a Queen Bee using Eq. (7). Until Termination Condition is Met Return the predictive and informative genes.

3. Experimental setup and results 3.1. Experiential setup A microarray dataset is commonly represented as an N × M matrix, where N is the number of the experimental samples and M is the number of genes involved in the experiments. Each cell in the matrix is the level of expression of a specific gene in a specific experiment. In this section, we evaluate the overall performance of gene selection methods using six popular binary and multiclass microarray cancer datasets, which were downloaded from http://www.gems-system.org. These datasets have been widely used to benchmark the performance of gene selection methods in bioinformatics field. The binary-class microarray datasets are: Colon (Alon et al., 1999), Leukemia (Golub et al., 1999), (Alon et al., 1999), and Lung (Beer et al., 2002). Where the multi-class microarray datasets are: SRBCT (Khan et al., 2001), Lymphoma (Alizadeh et al., 2000), and Leukemia (Armstrong et al., 2001). In Table 1 we present a detailed description of these six benchmark microarray gene expression datasets with respect to the number of classes, number of samples, and number of genes. Binary class microarray datasets. The first binary class microarray dataset was obtained from cancerous and normal colon tissues. Among them, 40 samples are from tumors and 22 samples are from healthy parts of the colons of the same patients (Alon et al., 1999). The second dataset was obtained from cancer patients with two different types of leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The complete dataset contains 25 AML and 47 ALL samples (Golub et al., 1999). The last binary class dataset is the lung cancer microarray dataset (Beer et al., 2002), which includes 86 primary lung adenocarcinomas samples and 10 non-neoplastic lung samples. Each sample is described by 7129 genes. Multi-class microarray datasets. In our experiment, the small round blue cell tumors (SRBCTs), which contained 4 different

55

Table 2 GBC control parameters. Parameter

Value

Colony size Max cycle Number of run Limit Crossover Propriety Rate (CPR) Crossover Propriety Rate (MPR)

80 100 30 5 0.8 0.01

childhood tumors were used, named so because of their similar appearance on routine histology, which makes correct clinical diagnosis extremely challenging. However, accurate diagnosis is essential because the treatment options, responses to therapy, and prognoses vary widely depending on the diagnosis. The SRBCT datasets include 29 Ewing’s sarcoma (EWS) samples, 18 neuroblastoma (NB) samples, 11 Burkitt’s lymphoma (BL) samples, and 25 rhabdomyosarcoma (RMS) samples (Khan et al., 2001). The second multi-class dataset was the lymphoma dataset, which contains the three most prevalent adult lymphoid malignancies. It contains 62 samples consisting of 4026 genes spanning three classes, which include 42 Diffuse Large B-Cell Lymphoma (DLBCL) samples, 9 Follicular Lymphoma (FL) samples, and 11 B-cell Chronic Lymphocytic Leukemia (B-CLL) samples (Alizadeh et al., 2000). The last dataset was obtained from leukemia cancer patients with three different types of leukemia: acute myeloid leukemia (AML), acute lymphoblastic leukemia (ALL), and mixed-lineage leukemia (MLL). The complete dataset contains 28 AML, 24 ALL, and 20 MLL samples (Golub et al., 1999). Table 2 shows the control parameters for the GBC algorithm that was used in our experiments. The first control parameter is the bee colony size or population, with a value of 80. The second control parameter is the maximum cycle, which equal to the maximum number of generations. A value of 100 is used for this parameter. Another control parameter is the number of runs, which was used as stopping criteria and we used a value of 30 in our experiments, which has been shown to be acceptable. The fourth control parameter is the limit, which represents the maximum number of iteration allowed when the food source is not improved (exhausted). If the food source exceeds this limit, it will be selected by the scout bee. A value of 5 iterations is used for this parameter. The last two control parameters are the Crossover Propriety Rate CPR, and Mutation Propriety Rate MPR, which are genetic control parameters in our proposed algorithm that are not adjustable since the algorithm is not very sensitive to their changes. A value of 0.6 is used for CPR, and 0.01 for MPR, that are shown to be acceptable. In this study, we tested the performance of the proposed GBC algorithm by comparing it with other standard bio-inspired algorithms, including ABC, GA, and PSO. We compared the performance of each gene selection approach based on two parameters: the classification accuracy and the number of predictive genes that have been used for cancer classification. Classification accuracy is the overall correctness of the classifier and is calculated as the sum of correct cancer classifications divided by the total number

Table 1 Statistics of microarray cancer datasets. Microarray datasets

Number of classes

Number of samples

Number of genes

Reference

Colon Leukemia1 Lung SRBCT Lymphoma Leukemia2

2 2 2 4 3 3

62 72 96 83 62 72

2000 7129 7129 2308 4026 7129

Alon et al. (1999) Golub et al. (1999) Beer et al. (2002) Khan et al. (2001) Alizadeh et al. (2000) Armstrong et al. (2001)

56

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

Table 3 The classification accuracy performance of the mRMR method with an SVM classifier for all microarray datasets. Number of genes

Colon

Leukemia1

Lung

SRBCT

Lymphoma

Leukemia2

50 100 150 200 250 300 350 400

91.94% 93.55% 95.16% 96.77% 98.38% 98.38% 100% 100%

91.66% 97.22% 100% 100% 100% 100% 100% 100%

89.56% 95.83% 98.95% 100% 100% 100% 100% 100%

62.65% 91.44% 96.39% 97.59% 100% 100% 100% 100%

93.93% 98.48% 100% 100% 100% 100% 100% 100%

77.77% 86.11% 95.83% 98.61% 100% 100% 100% 100%

of classifications. It computed by the expression shown below: Classification Accuracy =

CC × 100 N

(8)

where N is the total number of the instances in the initial microarray dataset. And, CC refers to correctly classified instances. Due to the efficiency of SVM algorithm to classify highdimensional datasets, we adopt it as a classification technique that is used to measure the classification accuracy of the selected genes. From early stage of the SVM, researchers have used the linear, polynomial, and RBF kernels for classification problems (Nahar et al., 2007). Among these, polynomial and RBF are the nonlinear kernel, and cancer classification using microarray dataset is a nonlinear classification task (Nahar et al., 2007). Nahar et al. Nahar et al. (2007) observed from their experiment out of nine microarray datasets that the polynomial kernel is a first choice for microarray classification. Therefore, we used polynomial kernel for SVM classifier and value of 1 is used for the complexity constant parameter C and the random number of seed parameter W. In addition, the tolerance parameter L is used for checking the stopping criterion. A value of 0.0010 is used for this parameter. Also, a value of 1.0E−12 is used for round-off error, which is named epsilon parameter P. For the cache size parameter C, we used a value of 250,007. In addition, we apply leave-one-out cross-validation (LOOCV) (Ng, 1997) in order to evaluate the performance of our proposed algorithm and the existing methods in the literature. LOOCV is very suitable to our problem, because it has the ability to prevent the “overfitting” problem (Ng, 1997). It also provides an unbiased estimate of the generalization error for stable classifiers such as the SVM classifier. In LOOCV, a single observation from the original sample is considered testing data, and the remaining observations are considered training data. This is repeated such that each

observation in the sample is used once as the testing data. We implement a GA, PSO algorithm, and SVM using the Waikato Environment for Knowledge Analysis (WEKA version 3.6.10), an open source data mining tool (N. Z. University of Waikato, 2012). Furthermore, in order to make experiments more statistically valid, we conduct each experiment 30 times on each dataset. In addition, best, worst, and average results of the classification accuracies of the 30 independent runs are calculated in order to evaluate the performance of our proposed algorithm. 3.2. Experimental results In this section, we present and analyze the results that are obtained by our algorithm. As a first step, we employed the mRMR method to identify the top relevant genes that give 100% accuracy with an SVM classifier. From Table 3 and Fig. 6,we can see that the top 150 genes in the leukemia1 dataset generate 100% classification accuracy. While in the colon dataset, we can get 100% accuracy using 350 genes. For the lung dataset, we achieved 100% accuracy using 200 genes and 250 genes to get the same classification accuracy for the SRBCT dataset. In addition, using 150 high relevant genes from the lymphoma dataset and 250 genes from the leukemia2 dataset, we achieved 100% classification accuracy. Then we used these high relevant genes as input in the GBC algorithm to determine the most predictive and informative genes. We compared the performance of the proposed GBC algorithm with the mRMR-ABC algorithm and the original ABC algorithms results shown in our previous research (Alshamlan et al., in press), when using SVM as a classifier with the same number of selected genes for all six benchmark microarray datasets. The comparison results for the binary-class microarray datasets: colon, leukemia1, and lung are shown in Tables 4–6, respectively. While, Tables 7–9,

Fig. 6. The classification accuracy performance of the mRMR method with an SVM classifier for all microarray datasets.

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

57

Table 4 The performance of the proposed GBC algorithm as compared to the mRMR-ABC and ABC algorithms when applied with the SVM classifier for Colon dataset. Number of genes

Classification accuracy GBC

3 4 5 6 7 8 9 10 15 20

mRMR-ABC

ABC

Best

Mean

Worst

Best

Mean

Worst

Best

Mean

Worst

90.32% 93.55% 95.16% 95.16% 95.16% 95.16% 96.77% 98.38% 98.38% 98.38%

88.81% 90.16% 91.51% 91.93% 92.26% 92.26% 93.06% 93.92% 94.62% 95.64%

87.10% 88.71% 88.71% 90.32% 90.32% 90.32% 90.32% 91.93% 91.93% 91.93%

88.71% 90.23% 91.94% 91.94% 93.55% 93.55% 93.55% 93.55% 96.77% 96.77%

87.50% 88.27% 89.50% 90.12% 91.64% 91.80% 92.11% 92.74% 93.60% 94.17%

85.48% 87.10% 87.10% 87.10% 88.81% 88.81% 90.16% 90.16% 91.93% 91.93%

87.10% 87.10% 90.32% 90.32% 91.94% 91.94% 91.94% 93.55% 93.55% 95.16%

85.91% 86.71% 87.98% 88.44% 90.20% 90.61% 90.95% 91.31% 91.38% 92.44%

83.87% 85.48% 85.48% 85.48% 88.81% 88.81% 88.81% 88.81% 90.32% 90.32%

Table 5 The performance of the proposed GBC algorithm as compared to the mRMR-ABC and ABC algorithms when applied with the SVM classifier for Leukemia1 dataset. Number of genes

Classification accuracy GBC

2 3 4 5

mRMR-ABC

ABC

Best

Mean

Worst

Best

Mean

Worst

Best

Mean

Worst

93.05% 95.83% 100% 100%

91.29% 94.07% 95.09% 96.43%

80.55% 91.66% 91.66% 93.05%

91.66% 93.05% 94.44% 95.83%

89.63% 90.37% 91.29% 92.82%

81.94% 83.33% 86.11% 88.88%

88.88% 90.27% 93.05% 93.05%

86.45% 89.82% 91.15% 91.89%

81.94% 83.33% 88.88% 88.88%

Table 6 The performance of the proposed GBC algorithm as compared to the mRMR-ABC and ABC algorithms when applied with the SVM classifier for Lung dataset. Number of genes

Classification accuracy GBC

2 3 4 5 6 7 8

mRMR-ABC

ABC

Best

Mean

Worst

Best

Mean

Worst

Best

Mean

Worst

97.91% 98.95% 100% 100% 100% 100% 100%

96.87% 98.22% 98.95% 98.95% 98.95% 99.11% 99.50%

95.83% 97.91% 97.91% 97.91% 97.91% 98.95% 98.95%

96.87% 97.91% 98.95% 98.95% 98.95% 98.95% 100%

95.83% 96.31% 97.91% 97.98% 98.27% 98.53% 98.95%

93.75% 93.75% 96.87% 96.87% 96.87% 96.87% 96.87%

88.54% 89.58% 91.66% 92.70% 94.79% 95.83% 97.91%

87.5% 88.54% 89.58% 90.03% 91.66% 92.18% 93.75%

84.37% 84.37% 87.5% 88.54% 88.54% 89.58% 91.66%

Table 7 The performance of the proposed GBC algorithm as compared to the mRMR-ABC and ABC algorithms when applied with the SVM classifier for SRBCT dataset. Number of genes

Classification accuracy GBC

2 3 4 5 6

mRMR-ABC

ABC

Best

Mean

Worst

Best

Mean

Worst

Best

Mean

Worst

77.11% 90.36% 95.18% 98.79% 100%

75.90% 86.74% 92.77% 95.18% 96.38%

72.82% 81.92% 87.75% 92.77% 95.18%

75.90% 85.54% 87.95% 91.56% 95.36%

71.08% 79.51% 84.33% 86.74% 91.56%

68.67% 71.08% 77.10% 84.33% 87.99%

72.28% 73.34% 84.33% 87.95% 92.77%

69.87% 71.08% 81.92% 84.33% 87.99%

67.46% 68.67% 77.10% 77.10% 84.33%

Table 8 The performance of the proposed GBC algorithm as compared to the mRMR-ABC and ABC algorithms when applied with the SVM classifier for Lymphoma dataset. Number of genes

Classification accuracy GBC

2 3 4 5

mRMR-ABC

ABC

Best

Mean

Worst

Best

Mean

Worst

Best

Mean

Worst

86.36% 96.96% 100% 100%

86.36% 92.42% 95.45% 98.48%

86.36% 87.87% 93.93% 95.45%

86.36% 93.93% 96.96% 100%

86.36% 90.90% 92.42% 96.96%

86.36% 86.36% 89.39% 93.93%

86.36% 89.39% 93.93% 96.96%

86.36% 87.87% 89.39% 92.42%

86.36% 86.36% 86.36% 90.90%

58

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

Table 9 The performance of the proposed GBC algorithm as compared to the mRMR-ABC and ABC algorithms when applied with the SVM classifier for Leukemia2 dataset. Number of genes

Classification accuracy GBC

2 3 4 5 6 7 8

mRMR-ABC

ABC

Best

Mean

Worst

Best

Mean

Worst

Best

Mean

Worst

84.72% 90.27% 93.05% 95.83% 98.61% 98.61% 100%

84.72% 87.5% 90.27% 91.66% 93.05% 94.44% 95.83%

84.72% 84.72% 86.11% 87.5% 91.66% 91.66% 90.27%

84.72% 87.5% 90.27% 90.27% 94.44% 93.05% 94.44%

84.72% 86.11% 87.5% 88.88% 90.27% 89.49% 91.66%

84.72% 84.72% 84.72% 86.11% 87.5% 88.88% 87.5%

84.72% 86.11% 87.5% 87.5% 90.27% 90.27% 91.66%

84.72% 85.23% 86.11% 86.45% 88.88% 89.22% 90.27%

84.72% 84.72% 84.72% 84.72% 86.11% 86.11% 88.88%

Table 10 The classification accuracy of the existing gene selection algorithms under comparison when combined with the SVM as a classifier for sex microarray datasets. Numbers in parentheses denote the numbers of selected genes. Algorithms

Colon

Leukemia1

Lung

SRBCT

Lymphoma

Leukemia2

GBC mRMR-ABC (Alshamlan et al., in press) ABC mRMR-GA mRMR-PSO PSO (Qi et al., 2007) PSO (Javad and Giveki, 2013) mRMR-PSO (Javad et al., 2012) GADP (Lee and Leu, 2011) mRMR-GA (Amine et al., 2009) ESVM (Huang and Chang, 2007) MLHD-GA (Huang et al., 2007) CFS-IBPSO (Yang et al., 2008) GA (Peng et al., 2003) mAnt (Yu et al., 2009)

98.38 (10) 96.77 (15) 95.16 (20) 95.61 (83) 93.55 (78) 85.48 (20) 87.01 (2000) 90.32 (10)

100 (4) 100 (14) 95.83 (20) 93.05 (51) 95.83 (53) 94.44 (23) 93.06 (7129) 100 (18)

100 (4) 100 (8) 97.91 (8) 95.83 (62) 94.79 (65)

100 (6) 95.36 (6) 92.77 (6) 92.77 (74) 93.97 (68)

100 (4) 100 (5) 96.96 (5) 93.93 (43) 96.96 (82)

100 (8) 94.44 (8) 91.66 (8) 94.44 (57) 95.83 (61)

100 (8)

100 (6) 95 (5)

100 (15) 95.75 (7) 97.1 (10) 93.55 (12) 91.5 (8)

98.75 (6) 100 (11)

100 (6) 100 (9)

respectively, present the comparison result for multi-class microarray datasets: SRBCT, lymphoma, and leukemia2. From these tables, it is clear that our proposed mRMR-ABC algorithm performs better than the original ABC algorithm in every single case (i.e., all datasets using a different number of selected genes). In this research, we re-implement mRMR with particle swarm optimization (mRMR-PSO) and mRMR with Genetic Algorithm (mRMR-GA) in order to compare the performance of the GBC algorithm with the same parameters. In addition, we compare it with the original ABC and the mRMR-ABC algorithm (Alshamlan et al., in press), we also compare it with published results for recent gene selection algorithms. Notably, all these algorithms have been combined with the SVM as a classification approach. Table 10 shows the experimental results of the GBC algorithm and other existing methods. Compared with the GBC algorithm, the mAnt method opposed by Yu et al. (2009) selected fewer genes on the Colon dataset. The mAnt method selected 8 genes and achieved 91.5% classification accuracy. In contrast, the GBC algorithm selects 10 genes and achieves 100% classification accuracy. For the Leukemia1 dataset, the GBC algorithm achieves 100% classification accuracy with 4 selected genes. In comparison, our previous algorithm mRMR-ABC (Alshamlan et al., in press), Javad et al. (2012), Peng et al. (2003), and Yu et al. (2009) achieved 100% classification accuracy; however, their selected genes are greater. For the Lung dataset, the GBC algorithm selected 5 genes to achieve 100% classification accuracy. The mRMR-ABC algorithm (Alshamlan et al., in press) selected 8 genes to achieve 100%, and themRMRGA algorithm proposed by Amine et al. (2009) selected 15 genes in order to achieve 100% accuracy on the same dataset. For SRBCT dataset, the GADP algorithm proposed by Lee and Leu (2011) and the MLHD-GA algorithm proposed by Huang et al. (2007) achieved 100% classification accuracy. The GADP algorithm selected 8 genes and the MLHD-GA algorithm selected 11 genes. By contrast, the GBC algorithm selects 6 genes and achieves

100 (6) 100 (6)

100 (9) 98.57 (41)

100 (7)

100% classification accuracy. Although there are many existing algorithms that achieve 100% for the Lymphoma dataset, the GBC algorithm selected a smaller number of predictive genes. The GBC selected only 5 genes to achieve 100% classification accuracy for the Lymphoma dataset. Finally, for the Leukemia2 dataset, the GBC method selected 8 genes to achieve 100% classification accuracy. By comparison, the MLHD-GA algorithm proposed by Huang et al. (2007) selected 9 genes to achieve 100% classification accuracy. The explanation of the best predictive and high frequent genes that give highest classification accuracy for all microarray datasets using GBC algorithm have been reported in Table 11. To summarize, for all binary and multi-class microarray datasets, the existing methods tend to select few genes with high classification accuracy. In comparison, the GBC algorithm selects fewer genes than the existing methods with relatively high classification accuracy. Furthermore, for those methods selecting fewer genes than the GBC algorithm, their classification accuracy is less than that of the GBC algorithm. Moreover, the proposed GBC algorithm achieves the highest classification accuracy and the lowest average of selected genes when tested using all datasets as

Table 11 The best predictive genes that give highest classification accuracy for all microarray dataset using GBC algorithm. Datasets Colon

Predictive genes

Gene1771, Gene1548, Gene85, Gene923, Gene14, Gene164, Gene2, Gene175, Gene1531, Gene99 Leukemia1 M31523 at, X62320 at, X66401 cds1 at, M92287 at X67325 at, X89067 at, U89336 cds3 at, M19722 at Lung SRBCT Gene2308, Gene1, Gene545, Gene1662, Gene1087, Gene1636 Lymphoma Gene303X, Gene1219X, Gene2399X, Gene2015X Leukemia2 X14767 at, D49950 at, X70944 s at, X00274 at, X96752 at, U94855 at, U89922 s at, M31523 at

Accuracy 98.39 % 100% 100% 100% 100% 100%

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

compared to the original ABC algorithm and our previously proposed mRMR-ABC algorithm (Alshamlan et al., in press) under the same cross-validation approach. Therefore, we can conclude that GBC is a promising approach for solving gene selection and cancer classification problems. 4. Conclusion In this research paper, we proposed a new Artificial Bee Colonybased algorithm called the GBC hybrid gene selection approach to be combined with SVM as a classifier. It can be used to solve classification problems that deal with high dimensional datasets, especially microarray gene expression profile. In our proposed algorithm, we adopted four modifications of the original ABC algorithm in order to combine the exploration and exploitation capabilities of the GA and the ABC algorithm. First, we preprocessed the microarray dataset using an mRMR filter method to reduce the computational time and cost of the ABC algorithm. Then, in order to benefit from information sharing about the position of food sources (solutions) given by the employee bees, we improved the ABC algorithm by adding to it a uniform crossover operation as a new step in the onlooker phase. Subsequently, we increased the number of scout bees to two rather than one scout bee in the basic ABC algorithm to improve the movement speed. Finally, in order to achieve a balance between the exploitation and exploration capabilities of the ABC algorithm and to improve its local search and exploitation abilities, we adopted mutation operators from the GA during the replacement process in the exhausted solution at the scout bee phase. Extensive experiments were conducted using six binary and multi-class microarray datasets. The results showed that the proposed GBC algorithm outperforms the previously reported results. In the future, experimental results on more real and benchmark datasets to verify and extend this proposed algorithm. In addition, the GBC algorithm can be considered as a general framework that can be used to solve various optimization problems. Competing interests The authors declare that they have no competing interests. Author’s contributions Hala Alshamlan conceived, designed, implemented, tested, analyses the results, and drafting of the manuscript, and critically revised the final manuscript. Ghada badr designed and participated in the coordination of the study, helped in drafting the manuscript, and critically revised the manuscript. Yousef Alohali participated in the coordination of the study. All authors participated in analysis and interpretation of results. All authors read and approved the final manuscript. Acknowledgements This research project was supported by a grant from the Research Center of the Center for Female Scientific and Medical Colleges, Deanship of Scientific Research, King Saud University. References Alba, E., Garcia-Nieto, J., Jourdan, L., Talbi, E.-G., 2007. Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms. In: IEEE Congress on Evolutionary Computation, 2007 (CEC 2007), pp. 284–290. Alizadeh, A., Eisen, M., Davis, M., Rosenwald, A., Boldrick, J., Sabet, T., Powell, Y., Yang, L., Marti, G., Moore, T., Hudson, J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., Staudt, L., 2000. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403 (6769), 503–511.

59

Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U. S. A. 96 (12), 6745–6750. Alshamlan, H.M., Badr, G.H., Alohali, Y.A., 2014. The performance of bio-inspired evolutionary gene selection methods for cancer classification using microarray dataset. Int. J. Biosci. Biochem. Bioinform. 4 (3), 166–170. Alshamlan, H., Badr, G., Alohali, Y., 2014. A comparative study of cancer classification methods using microarray gene expression profile. In: DaEng, vol. 285 of Lecture Notes in Electrical Engineering. Springer, pp. 389–398. Alshamlan, H., Badr, G.H., Al-Ohali, Y., 2015. mRMR-ABC: a hybrid gene selection algorithm for microarray cancer classification. BioMed Res. Int. J. (in press) http://www.hindawi.com/journals/bmri/aip/604910/ Amine, A., El Akadi, A., El Ouardighi, A., Aboutajdine, D., 2009. A new gene selection approach based on minimum redundancy-maximum relevance (mRMR) and Genetic Algorithm (GA). In: IEEE/ACS International Conference on Computer Systems and Applications, 2009 (AICCSA 2009), pp. 69–75. Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J., 2001. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 30 (1), 41–47. Beer, D.G., Kardia, S.L., Huang, C.-C., Giordano, T.J., Levin, A.M., Misek, D.E., Lin, L., Chen, G., Gharib, T.G., Thomas, D.G., et al., 2002. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 8 (8), 816–824. Eiben, A.E., Smith, J.E., 2003. Introduction to Evolutionary Computing. Springer. Ghorai, S., Mukherjee, A., Sengupta, S., Dutta, P., 2010. Multicategory cancer classification from gene expression data by multiclass NPPC ensemble. In: 2010 International Conference on Systems in Medicine and Biology (ICSMB), pp. 4–48. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, L., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E., 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (5439), 531–537. Huang, H.-L., Chang, F.-L., 2007. ESVM: evolutionary support vector machine for automatic feature selection and classification of microarray data. Biosystems 90 (2), 516–528. Huang, H.-L., Lee, C.-C., Ho, S.-Y., 2007. Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers. Biosystems 90 (1), 78–86. Jatoth, R.K., Rajasekhar, A., 2010. Speed control of PMSM by hybrid genetic Artificial Bee Colony algorithm. In: 2010 IEEE International Conference on Communication Control and Computing Technologies (ICCCCT), pp. 241–246. Javad, A.M., Giveki, D., 2013. Automatic detection of erythemato-squamous diseases using PSO-SVM based on association rules. Eng. Appl. Artif. Intell. 26 (1), 603–608. Javad, A.M., Mohammad, H.S., Rezghi, M., 2012. A novel weighted support vector machine based on particle swarm optimization for gene selection and tumor classification. Comput. Math. Methods Med. Kıran, M.S., Gündüz, M., 2012. A novel Artificial Bee Colony-based algorithm for solving the numerical optimization problems. Int. J. Innov. Comput. Inf. Control 8 (9), 6107–6121. Karaboga, D., Akay, B., 2009. A comparative study of Artificial Bee Colony algorithm. Appl. Math. Comput. 214 (1), 108–132. Karaboga, D., Basturk, B., 2007. A powerful and efficient algorithm for numerical function optimization: Artificial Bee Colony (ABC) algorithm. J. Glob. Optim. 39 (3), 459–471. Karaboga, D., Basturk, B., 2008. On the performance of Artificial Bee Colony (ABC) algorithm. Appl. Soft Comput. 8 (1), 687–697, http://dx.doi.org/10. 1016/j.asoc.2007.05.007. Karaboga, D., 2005. An idea based on honey bee swarm for numerical optimization. Tech. rep., Technical Erciyes University, Engineering Faculty, Computer Engineering Department. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al., 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7 (6), 673–679. Lee, C.-P., Leu, Y., 2011. A novel hybrid feature selection method for microarray data analysis. Appl. Soft Comput. 11 (1), 208–213, http://dx.doi.org/10. 1016/j.asoc.2009.11.010. Milan, T., 2013. Artificial Bee Colony (ABC) algorithm with crossover and mutation. Appl. Soft Comput., 687–697. N. Z. University of Waikato, Waikato environment for knowledge analysis, http://www.cs.waikato.ac.nz/ml/weka/downloading.html (accessed 06.12.14). Nahar, J., Ali, S., Chen, Y.-P.P., 2007. Microarray data classification using automatic SVM kernel selection. DNA Cell Biol. 26 (10), 707–712. Narendra, P.M., Fukunaga, K., 1977. F.K., A branch and bound algorithm for feature subset selection. IEEE Trans. Comput. 26 (9), 917–922, http://dx.doi.org/10. 1109/TC.1977.1674939. Nebojsa, B., Milan, T., 2012. Artificial Bee Colony (ABC) algorithm for constrained optimization improved with genetic operators. Stud. Inform. Control 21 (2), 137–146. Ng, A.Y., 1997. Preventing “overfitting” of cross-validation data. ICML, vol. 97., pp. 245–253. Peng, S., Xu, Q., Ling, X.B., Peng, X., Du, W., Chen, L., 2003. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett. 555 (2), 358–362.

60

H.M. Alshamlan et al. / Computational Biology and Chemistry 56 (2015) 49–60

Peng, H., Long, F., Ding, C., 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pat. Anal. Mach. Intell. 27 (8), 1226–1238. Qi, S., Shi, W.-M., Wei, K., Ye, B.-X., 2007. A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Adv. Comput. Sci. 71 (4), 157–162. Sheng-Bo, G., Michael, L., Ming, L.,2006. Gene selection based on mutual information for the classification of multi-class cancer. In: Proceedings of the 2006 International Conference on Computational Intelligence and Bioinformatics – vol. Part III (ICIC’06). Springer-Verlag, pp. 454–463. Simon, R., 2009. Analysis of DNA microarray expression data. Best Pract. Res. Clin. Haematol. 22 (2), 271–282.

Wang, X., Gotoh, O., 2009. Microarray-based cancer prediction using soft computing approach. Cancer Inform. 7, 123–139. Xiang, W.-l., An, M.-q., 2013. An efficient and robust Artificial Bee Colony algorithm for numerical optimization. Comput. Oper. Res. 40 (5), 1256–1265. Yang, C.-S., Chuang, L.-Y., Ke, C.-H., Yang, C.-H., 2008. A hybrid feature selection method for microarray classification. Int. J. Comput. Sci. 35, 285–290. Yu, H., Gu, G., Liu, H., Shen, J., Zhao, J., 2009. A modified ant colony optimization algorithm for tumor marker gene selection. Genom. Proteom. Bioinform. 7 (4), 200–208. Zhu, G., Kwong, S., 2010. Gbest guided Artificial Bee Colony algorithm for numerical function optimization. Appl. Math. Comput. 217 (7), 3166–3173.

Genetic Bee Colony (GBC) algorithm: A new gene selection method for microarray cancer classification.

Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Col...
1MB Sizes 0 Downloads 12 Views