Computerized system for recognition of autism on the basis of gene expression microarray data.

Computers in Biology and Medicine 56 (2015) 82–88

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

Computerized system for recognition of autism on the basis of gene expression microarray data Tomasz Latkowski a, Stanislaw Osowski a,b,n a b

Military University of Technology, Institute of Electronic Systems, Warsaw, Kaliskiego 2, Poland Warsaw University of Technology, Institute of the Theory of Electrical Engineering, Measurement and Information Systems, Warsaw, Koszykowa 75, Poland

art ic l e i nf o

a b s t r a c t

Article history: Received 13 June 2014 Accepted 2 November 2014

The aim of this paper is to provide a means to recognize a case of autism using gene expression microarrays. The crucial task is to discover the most important genes which are strictly associated with autism. The paper presents an application of different methods of gene selection, to select the most representative input attributes for an ensemble of classifiers. The set of classifiers is responsible for distinguishing autism data from the reference class. Simultaneous application of a few gene selection methods enables analysis of the ill-conditioned gene expression matrix from different points of view. The results of selection combined with a genetic algorithm and SVM classifier have shown increased accuracy of autism recognition. Early recognition of autism is extremely important for treatment of children and increases the probability of their recovery and return to normal social communication. The results of this research can find practical application in early recognition of autism on the basis of gene expression microarray analysis. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Gene expression microarray Gene selection Ensemble of classifiers SVM Random forest

1. Introduction Autism spectrum disorders (ASD) are pervasive neurodevelopmental disorders that affect different aspects of human functions [1]. There are many studies [2–4] oriented at biomarker identification for ASD focused on genetic variants. An important research direction is studying the gene expression microarray in search for gene expression signatures which are informative with respect to identification of ASD. Gene microarray technology is a technique for detecting alterations in the expression of thousands of genes simultaneously between different biological conditions [5]. The analysis of the expression levels allows to detect altered gene expression of particular genes in a considered disease when compared to healthy controls. The most relevant genes strictly associated with the mechanism of disease formation allow to predict the potential danger of being affected by such a disease. The most important problem in this analysis is a small number of observations (usually in the range of hundreds) related to a very large number of gene expressions, usually tens of thousands. This

n Corresponding author at: Military University of Technology, Institute of Electronic Systems, Warsaw University of Technology, Institute of the Theory of Electrical Engineering, Measurement and Information Systems, Warsaw, Koszykowa 75, Poland. Tel.: þ 48 22 234 7235. E-mail addresses: [email protected] (T. Latkowski), [email protected] (S. Osowski).

http://dx.doi.org/10.1016/j.compbiomed.2014.11.004 0010-4825/& 2014 Elsevier Ltd. All rights reserved.

considerable imbalance of the number of genes and observations makes the selection an ill-conditioned problem. Moreover, the data stored in medical databases related to autism are typically noisy and some gene sequences have large variance [6]. Recent progress in data mining techniques, which started from the pioneering work of Golub team [7], has laid solid foundations for discovering the genes which are best associated with a particular disease. Actual approaches to the task of gene selection include different clustering methods [8], application of neural networks and Support Vector Machines [9,10], statistical tests [11], linear regression methods applying forward and backward selection [12,13], fuzzy logic based algorithms [14], rough set theory [15], various statistical methods [8,16], as well as a combination of many selection methods [10,17]. The progress in this field depends on the type of investigated illness. ASDs belong to the most difficult cases because of large variation of gene expression levels among individuals [3,6,18]. The recent approach presented in the paper [3] is based on the analysis of data belonging to phenotypic subgroups. However, the actually reported accuracy of recognition between the reference class and the combined autistic group is still only 81.8% [3]. This rate is not satisfactory from the practical point of view. This paper proposes a different approach to the problem. We apply many gene validation and selection methods cooperating with the support vector machine (SVM) classifiers. They form the so-called ensemble of classifiers integrated into the final system by a random forest of decision trees [19]. Applying different techniques

T. Latkowski, S. Osowski / Computers in Biology and Medicine 56 (2015) 82–88

of gene selection allows looking on the selection problem from different points of view. After fusing their results into a single outcome the probability of proper recognition of classes is increased. We will demonstrate that our approach is able to increase significantly the autism recognition accuracy. The most important contribution of the paper is developing the fusion system of the results of many selection/classification approaches to the final decision of the increased accuracy. The presented approach is in contrast to the majority of papers, where different methods have been tried, but only one (the best one) was used as the final solution. The results of numerical experiments performed on the NCBI autism database [20] have confirmed the superiority of the proposed approach.

2. Materials The numerical experiments in autism recognition have been performed on a publicly available database, downloaded from the GEO (NCBI) repository [20]. The number of observations in this dataset equals 146 and the number of genes—54,613. The database consists of two classes. The first one is related to children with autism and the number of such instances is n ¼82. The second (control) group is composed of healthy children and the number of such instances is n ¼64. All subjects in the base are male. Probands and controls were all recruited from the Phoenix area. Blood draws for all subjects were done between the spring and summer of 2004. Total RNA was extracted for microarray experiments with Affymetrix Human U133 Plus 2.0 39 Expression Arrays. Children with autism were diagnosed by a medical professional (developmental pediatrician, psychologist or child psychiatrist) according to the DSM-IV criteria and the diagnosis was confirmed on the basis of ADOS and ADI-R criteria [21]. In an attempt to obtain a homogenous population of children with autism, non-classic forms of autism were excluded, including autism with regression and Asperger’s syndrome, a higher functioning form of autism where individuals have language skills within normal range. In addition, each subject had a normal highresolution chromosome analysis, and a negative Fragile X DNA test. IQ scores of autistic and control children were not done for this study, but all individuals did demonstrate a language impairment as part of the diagnostic criteria. For additional analysis, paternal ages were available for 78 children with autism and 57 controls. The group of children with autism was significantly younger than the control group (autism: mean—5.5 years SD—2.1; control: mean—7.9 SD—2.2). Paternal age was similar between the groups. The study population was primarily Caucasian and there were no group level differences in ethnicity. The box plots of the age of children and their parents are presented in Fig. 1. Our main task is to build a computer program that would be able to separate the autistic group from the control group with the highest accuracy on the basis of expression levels only. The labels of the testing group of subjects were not known for the program. We solved the problem in two phases. In the first one we applied a few methods of selecting small but optimal subsets of genes with good class discriminative abilities. In the next phase these genes were used as the input attributes to the SVM classifiers forming an ensemble. The final decision on the ensemble was worked out by an additional random forest classifier.

3. Genetic approach to gene selection In the task consisting in assessment of the class discrimination ability of genes we treated each gene as the diagnostic feature and applied well-known feature validation methods to solve the selection

83

problem. The following methods were applied: Fisher discriminant analysis, ReliefF algorithm, two sample t-test, Kolmogorov–Smirnov test, Kruskal–Wallis test, stepwise regression method, feature correlation with a class and SVM recursive feature elimination. A short description of them is provided in the Appendix. The operation principle of these methods relies on different foundations which allows to see the selection problem from different points of view. As a result of application of these methods we get 8 different sets of genes ordered according to their class discrimination ability, from the most to the least discriminative. However, at this stage none of the methods indicates their optimal size. The important task was to find the optimal population size of genes that will guarantee the best performance in the classification stage. To solve this problem we performed the following calculations using only a limited number of the most significant genes, selected at the first stage of validation. This task was done using the genetic algorithm [22,23] cooperating with the Support Vector Machine of Gaussian kernel [24]. Only 100 best genes were used at this stage of processing. We limited this number to optimize the performance and increase the speed of the genetic algorithm. It is known that the classifier of good generalization ability should be supplied by the limited number of input attributes (usually no more than few tens). Therefore, 100 best genes from each of the selection procedure should be enough to provide vast choice for genetic operations. The application of genetic algorithm was repeated separately for all 8 sets of genes chosen in the course of individual selection procedures. In this solution we used a binary code representation of an individual gene. The value 1 means the inclusion of the gene while zero indicates the lack of this gene in the input vector x to the classifier. In all experiments an elitist strategy of passing two fittest population members to the next generation was used. This guarantees that the fitness is never declined from one generation to the next, which is a desirable property in our application. The algorithm created crossover children by combining pairs of parents in the current population using the roulette rule. The crossover probability applied in the solution was 0.8. The mutation of children was created by randomly changing the genes of individual parents. The assumed mutation rate was 0.03. Each chromosome in the genetic algorithm is associated with the input vector x applied to the SVM classifier. The value 1 of the particular element of the chromosome vector means real inclusion of the gene and zero—no such gene in the actual vector. Two data sets were involved in the GA based training: the learning set and the validation set extracted from the learning one (20% of the learning data). The classifier is trained on the learning data and then tested on the validation data set. The testing error on the validation data forms the basis for the definition of the fitness function. Fitness is defined as the error function taken with a minus sign. The genetic algorithm maximizes the value of the fitness function (equivalent to the minimization of the error function) by performing subsequent operations of selection of parents, crossover among parents and finally mutation. The described process is repeated until a termination condition is reached. The applied terminating conditions are as follows: a solution is found that satisfies minimum criteria, fixed number of generations is reached, allocated computation time is reached, the highest ranking solution’s fitness is reached or a plateau such that successive iterations no longer produce better results is reached. In our approach fitness function is directly defined on the basis of a 10-fold cross validation error of the SVM classification system with the Gaussian kernel. Genetic algorithm presented a very effective way of finding the best set of the most significant features. Application of the GA to feature selection was found superior to other methods. GA performs two tasks simultaneously. It selects the most important features in the predefined set and at the same time determines

84


Fig. 1. The box plot representing the age of children and their parents of the base.

their optimal size. This is a unique fusion of the required demands, not available in the classical approaches to the selection problems.

4. Results of autism recognition 4.1. Applied system of class recognition The genes chosen in the selection phase can be used for classification of the microarray data divided into two classes: class 1—autism and class 2—control subjects. To get the most reliable results we applied the ensemble of classifiers composed of the Support Vector Machine (SVM) of Gaussian kernel [24] supplied by different sets of features which were selected by the genetic algorithm for all applied selection methods. Eight different SVM classifier systems supplied by sets of features selected by different methods were combined into an ensemble. In the next considerations the following abbreviations were used for feature selection methods: FDA—the Fisher discriminant analysis, RFA—the ReliefF algorithm, TT—the two-sample t-test, KST—the Kolmogorov–Smirnov test, KWT—the Kruskal– Wallis test, SWR—the stepwise regression method, COR—the feature correlation with a class, SVM—the SVM-RFE method. The final classification result of this ensemble is created by integrating individual outcomes using a random forest (RF) network of decision trees [19]. The procedure was repeated 10 times at different contents of the learning and testing data. The general scheme of data processing is presented in Fig. 2. To get the most objective results, the gene selection and classification stages were performed on different instances (60% of data for gene selection and the remaining 40% for classification). The selection and classification stages of autism recognition were repeated 10 times applying the randomly generated data used for selection and classification purposes. The applied SVM is one of the best solutions of binary classifiers. It was developed by Vapnik as a linear machine, working in the high dimensional feature space created by non-linear mapping of the N-dimensional input vector x into an L-dimensional feature space (L4N) by using the kernel function K(x,xi). The learning problem of SVM is defined as the task of separating the learning vectors into two classes of the destination values: di ¼1 (one class) or di ¼ 1 (the opposite class), with the maximal separation margin. The SVM of the Gaussian kernel was used in our application as the most universal and efficient one. The hyperparameters (the regularization constant C and Gaussian kernel width) were adjusted by

repeating the learning experiments for the set of their predefined values and choosing the best one on the validation data sets. The Breiman random forest [19] is used as an integrating tool in an ensemble of classifiers. It constructs many decision trees at a training time and outputs the class, which is the mode of the classes pointed by the individual trees. The learning data for each tree are selected randomly to improve the generalization ability of the forest. The random selection of input variables is also applied in each node of the trees. A random forest is used by us to integrate the results of 8 different classifiers into the final outcome. The output signals of these eight units form the input attributes for the random forest, which is responsible for generating the final recognition result. 4.2. Numerical results of classification There are many genes in the base, which have very similar or the same means in both classes. According to Fisher theory [25], such genes have no class discrimination ability. To make the selection process easier we excluded such genes from considerations just in the introductory phase of experiment. Thanks to this we were able to correct the proportion of the number of data to the number of genes. The whole columns representing not important genes were removed from the data set. To make the process automatic we set some threshold on the similarity of means in both classes. If the similarity of the particular gene was higher than threshold value this gene was eliminated. In setting the threshold value we aimed at the number of accepted genes, which is around one third of their original number. According to our experience such elimination has positive influence on the quality of the results. The active range of this threshold, which allowed to control the number of removed genes was extending from 0.9 to 1 in the considered base. We have assumed the value of 0.96 and this threshold allowed to reduce the number of genes from the original 54,613 to 17,831. All genes for which the ratio of means in both classes was higher than 0.96 were removed from the base. At the next stage eight feature selection methods were applied to discover the importance of the genes and their order. The selection procedure was repeated 10 times on randomly selected 60% of available observations (the remaining 40% of data was left for the classification stage). The positions of each gene were noted in all runs of selection and then summed up. The genes were ordered on the basis of this sum, starting from the most to the least significant. Because of a different principle of selection procedure the results of selection were also different. The redundancy rate among the best genes in different algorithms, observed for 100 most important


85

Fig. 2. General scheme of the applied classification system of autism data.

genes selected by different methods, ranged from 5% to 63%. The contents of the genes were also changing due to the random choice of the instances used in selection. The genes which were most important in classification were found by the genetic algorithm at the next stage. The genetic algorithm relies on a high level of randomness. Its final results depend on the contents of the sets of instances chosen randomly in each run. Therefore, the genes selected in different runs were differing. The following genes were among the most frequently selected ones: RMI1, NRIP1, TOP1, ZFHX3, CEP350, NFYA, PSENEN, ANP32A, SEMA4C, SP1. The important aim of our experiments was to find a solution system providing the highest accuracy of recognition of autistic instances. This assessment should be relied on a statistical accuracy of classification estimated in the most objective way. We did it by splitting a relatively small set of observations into smaller independent parts and repeating the classification procedure many times. In this research we did not pay too much attention to the contents of the selected genes. This might be the subject of another study. The remaining 40% of instances were left for the classification stage. Once again this part of data was split into learning (60%) and testing (40%) sets. The split of data into selection and classification parts was repeated 10 times. We applied a limited number of repetitions because of a very limited number of available instances in the base. It should be noted that only 40% of observations in the data base were used in classification experiments, of which only 40% (16% of the all data) created a testing set. It is a very small population used in 10-fold cross validation experiments. We also tried to repeat the experiments, however, they did not bring any significantly changed results because the exchanges of data in the succeeding experiments were not significant. Separation of the selection from the classification data allowed us to perform these two stages independently from each other. It means that the results of selection did not interfere with the classification process, which was performed on the data not seen in the selection phase. Repeating the experiments many times with different combinations of instances allowed to draw a more objective assessment of the efficiency of the proposed method. Moreover, thanks to the 10-fold cross validation procedure the entire data set could be used in both phases of experiments. As a result of such approach we got eight different outcomes of SVM classifiers, which were responsible for recognition of autism and reference class data. Each run of the 10-fold cross validation procedure of eight classifiers was followed by integration of their results into the final score. This task of integration was performed by a random forest network [19]. Fig. 3 shows the statistical importance of the applied methods in forming the final results by the random forest integrator. The impact of every method was measured as the percentage of times the particular classification result was chosen by the decision trees of random forest as the discriminating variable in the classification

Fig. 3. Relative importance of selection methods at the final integration stage of the ensemble made by RF.

process. Fisher and ReliefF methods were found to play the most significant role in taking the final classification decision by the RF integrator. On the other hand the least important in integration were the results of the two-sample t-test and stepwise regression methods. To check the importance of genes selected in our methods we compared their classification results to the results obtained during application of the same number of genes chosen from the database at random. The same number of runs (10) was used in this case. Table 1 presents the averaged class recognition accuracy of individual classification systems in the 10-fold cross validation procedure. The first row corresponds to the proposed selection methods and the second row—to the same number of genes chosen in a random way in the succeeding runs. The columns present the results of application of different selection methods. The last column depicts the final accuracy achieved after fusion, done by the random forest. The numbers in the table present mean values and standard deviations obtained in 10 independent runs of the selection and classification procedures. The final class recognition accuracy obtained by using the random forest as an integrator is significantly better. This accuracy increased from the best individual 79.84 7 5.66% related to FDA to 86.0772.79% after integration. Not only was the mean accuracy of recognition increased but also standard deviation was reduced almost twice. Another interesting thing is the results concerning random choice of genes. Even in this case the ensemble was able to significantly increase overall accuracy. This stresses the advantages of using an ensemble of classifiers. The details of class recognition will be also depicted in the form of a confusion matrix presenting the average percentage results of all 10 runs of the system (all results correspond to the testing data). The rows present the percentage of the real class membership of the data and the columns—the results of classification. The diagonal entries (i¼ j) depict the percentage of properly recognized classes. Each entry outside the diagonal represents the

86


percentage of the misclassified cases. The entry in the (i,j)th position of the matrix means false assignment of the ith class to the jth one. Table 2 presents the averaged results of class recognition. In order to compare the importance of our gene ranking methods we repeated the classification procedure 10 times with randomly chosen combinations of 30 genes. We used 30 genes because this was the average number of genes indicated by the genetic algorithm in different runs and different selection methods of experiments. Obtained results are presented in the form of a confusion matrix in Table 3. This time the averaged misclassification rate was very high (33.90%). The relations of the true positive to false negative and true negative to false negative cases were roughly 2:1. It is almost four times worse than the results in our approach to gene selection. We also tried to use a higher and smaller number of randomly selected genes, but the average results were very similar, differing by no more than 1%. This experiment confirmed that application of the best selected genes in a representation of samples provides the highest accuracy (the least relative error) of autism recognition. Another issue important in medical practice includes specially defined measures of quality. They allow to distinguish the true positive (TP) cases (proper recognition of autism) from true negative (TN) ones, representing proper recognition of reference instances. By the symbol FN we understand the number of autism falsely recognized as healthy and by FP the healthy cases recognized as autism. On the basis of these notations four quality measures were defined [25]. The true positive rate (TPR), also called sensitivity is defined as the fraction of all positive examples predicted correctly by the classifier TPR ¼ TP=ðTP þ FNÞ. The true negative rate (TNR), called specificity, is the fraction of the negative examples correctly predicted by the classifier TNR ¼ TN=ðTN þ FPÞ. The false alarm rate (FA) is defined as the ratio of the negative cases recognized by the classifier as positive FA ¼ FP=ðFP þ TNÞ. The false negative rate represents FNR ¼ FN=ðTP þFNÞ. Table 4 presents the values of these quality measures of the recognition of autism for two types of experiments. The first row presents the results of application of our system and the second row at randomly chosen 30 genes. The sensitivity and specificity values of the proposed selection and classification system assumed much higher levels in comparison to the randomly selected genes. We have made additional experiments on the data base used by other researchers to compare our method to their approach. We have used the same gene GEO base [26] as used in [3]. This base contains 39936 genes for 116 cases (87 combined autistic samples and 29 non-autistic controls). We followed the same strategy in the introductory elimination of genes, eliminating these which were absent in some observations. This step has reduced the number of genes to 3913 only. In our approach we have applied simultaneously 8 methods of gene selection and then followed our procedure of autism recognition using genetic algorithm and random forest in the role of an ensemble fusion. The 10-fold cross-validation was used to determine the accuracy, sensitivity and specificity of correct assignment to autistic case or control groups. The statistical results depicting the accuracy of different methods of gene selection and after their fusion are shown in Table 5.

The sensitivity of our system (after fusion) was 96.25% and the specificity 82.97%. By comparison, the corresponding results reported in the paper [3] by using one USC method of gene selection and SVM as a classifier were as follows: accuracy of correct assignment to autistic case or control group 81.8%, a sensitivity of 91% and a specificity of 61%.

5. Conclusions The paper presents the ensemble of classifiers for recognition of autism cases from the reference class on the basis of the gene expression microarray data. The important point in this method is selection of the most relevant genes which are strictly associated with autism. Eight different feature validation methods were applied in gene selection. The first 100 best genes formed the input for a genetic algorithm which was responsible for estimating the final (optimal) contents of genes. The selected genes were used as the input attributes for the classifiers which were responsible for distinguishing autism cases from the reference ones. To get the most reliable results we Table 2 Confusion matrix of class recognition results at application of the best genes after integration.

Class 1 Class 2

Class 1

Class 2

0.88 0.16

0.12 0.84

Table 3 Confusion matrix of classification results by applying 10 randomly selected genes (after integration).

Class 1 Class 2

Class 1

Class 2

0.69 0.39

0.31 0.61

Table 4 The values of quality measures in distinguishing autism cases from the healthy ones for the best genes selected by our method and for the randomly chosen 30 genes.

Best selected genes 30 Random genes

TPR

TNR

FNR

FA

0.85 0.70

0.88 0.62

0.15 0.30

0.13 0.38

Table 5 The averaged class recognition accuracy of SVM classifier supplied by a set of genes selected in different methods and the result of fusion. FDA [%]

RFA [%]

TT [%]

KST [%]

89.54

85.51

91.65 90.29

KWT [%]

COR [%]

SW [%]

SVM [%]

Fusion [%]

89.99

90.34

91.51

87.89

92.93

Table 1 The averaged class recognition accuracy and the standard deviations of the SVM classifier supplied by a set of genes selected in different methods.

Optimal number of the best genes Random choice of genes

FDA [%]

RFA [%]

TT [%]

KST [%]

KWT [%]

COR [%]

SW [%]

SVM [%]

Fusion [%]

79.84 7 5.66 55.127 4.65

70.157 5.49 51.36 7 10.00

72.22 7 5.57 54.727 7.15

71.647 5.19 54.197 4.77

71.447 4.81 51.017 3.45

73.09 7 3.06 55.28 7 5.61

71.40 7 5.05 53.677 3.59

68.49 7 6.41 55.22 7 5.16

86.077 2.79 67.167 2.09


applied the ensemble of classifiers composed of SVM of the Gaussian kernel and random forest playing the role of an integrating unit of the ensemble. The SVM classifiers were supplied by different sets of features chosen by various selection methods. The obtained results confirmed good performance of such a system. The average classification accuracy obtained on the first GEO base [20] in the 10-fold cross validation mode was above 86%. For the second base of autism [26] the results were even better. The accuracy of correct assignment to autistic case or control group was this time 92.93%. These results outperform the best result of 81.8% error reported in the most recent paper [3] for the autism data base [26]. In our opinion the main source of such a high efficiency is application of a genetic algorithm to the final selection of genes and use of random forest for integration of the ensemble of classifiers. The results presented in the paper form the first step of exploring the microarray data related to autism. They allow to identify the genes which are best associated with the disease. The developed system in the form of an ensemble of classifiers can be used in early prediction of this disease on the basis the microarray data. The proposed approach for gene expression analysis based on application of many methods of feature selection has proved its benefits in one of the most demanding problems of autism recognition. According to our experience the higher the number of independent selection methods the better expected accuracy of an ensemble of classifiers. In future we will try to increase the number of selection methods hoping to increase the final accuracy of autism recognition. At the same time we will check the efficiency of our approach in other problems related to different types of diseases, for example the cancers. We have conducted introductory experiments on breast cancer base [27,28] containing 78 patient samples, 34 of which were from patients who had developed distant metastases within 5 years and the rest 44 samples from patients who remained healthy from the disease after their initial diagnosis for interval of at least 5 years. The number of genes was 24,481. The accuracy offered by individual methods ranged from 76.34% (Kruskal–Wallis method) to the highest one 81.86% in the case of the Fisher method. After application of our procedure the recognition rate was increased to 86.97%. The best average classification accuracy reported in [15] for the same base was 74%. Another paper [29] dealing with different base containing 67 patient samples of breast cancer and 60 samples of healthy (29098 genes) has reported the average accuracy of 79.53%. On this background our results are encouraging and suggest that we should continue our study in this direction.

Conflict of interest statement None declared.

A.1. Fisher discriminant In Fisher discriminant analysis the greatest weight is assigned to feature which is characterized by a large difference of the mean values in two studied classes and a small value of standard deviations within each class. The two class discrimination measure of the feature f is defined in the form [25,30] S12 ðf Þ ¼

jc1 c2 j σ1 þ σ2

deviations. A large value of S12(f) indicates good class discriminative ability of the feature. A.2. ReliefF algorithm The reliefF algorithm ranks the features according to the highest correlation with the observed class while taking into account the distances between opposite classes [31]. It estimates the quality of the feature according to how well its values distinguish between observations that are near to each other. It selects randomly an instance Ri of observation and then searches for k of its nearest neighbors from the same class, called nearest hits Hj and also k nearest neighbors from each of the different classes, called nearest misses Mj(C). The quality estimation W(A) for all attributes A is updated depending on their values for Ri, hits Hj and misses Mj(C). If instances Ri and Hj have different values of the attribute A then this attribute separates two instances with the same class which is not desirable. In such case the quality estimation W(A) is decreased. If instances Ri and Mj have different values of the attribute A then this attribute separates two instances of different class values and the quality estimation W(A) is increased. The algorithm averages the contribution of all hits and misses. The process is repeated m times, where m is a user-defined parameter. The detailed description of the procedure can be found in [31]. A.3. Two-sample t-test The next used selection method is a two-sample Student t-test. It checks the null hypothesis test that data in the class 1 and 2 are independent random samples of normal distributions with equal means and equal, but unknown variances, against the alternative hypothesis that the means are not equal. The test statistic is formulated in the form c1 c2 t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðσ 21 =nÞ þ ðσ 22 =mÞ

ðA1Þ

where c1 and c2 represent the mean values for classes 1 and 2, respectively, while σ1 and σ2 are the appropriate standard

ðA2Þ

where and n and m represent the sample sizes of both classes [32]. Two sample ttest implemented in MATLAB as ttest2 function [33] returns the p-value of the test. Low value of p indicates that the compared populations are significantly different. It means a rejection of the null hypothesis at the usually assumed 5% significance level. A.4. Kolmogorov–Smirnov test The other statistical feature selection method applied in the research was the Kolmogorov–Smirnov (KS) test. It compares the medians of the groups of data to determine if the samples come from the same population [32].The null hypothesis is that both classes are drawn from the same continuous distribution. The alternative hypothesis is that they are drawn from different distributions. The KS test statistic is based on the relation KS ¼ maxðjF 1 ðxÞ F 2 ðxÞjÞ

Appendix. Applied gene selection methods

87

ðA3Þ

where F1(x) and F2(x) are the cumulative distribution of samples of feature f belonging to class 1 and 2. High value of this coefficient indicates that the feature has good class discrimination ability. On the other hand, a small value of this factor indicates that feature should be rejected at the selection stage. A.5. Kruskal–Wallis test In this method medians of the samples are compared, but in contrast to KS test it uses ranks of the data rather than the numeric values [32]. It finds ranks by ordering the data samples from the smallest to the largest across all groups and taking the numeric index of this ordering. It returns the p value for the null hypothesis

88


that all samples are drawn from the same population. The Kruskal– Wallis test is implemented in MATLAB as kruskalwallis function [33]. A.6. Stepwise regression method Stepwise regression is a systematic method for adding and removing features to the set of input attributes based on their statistical significance in a regression. It begins with an initial linear model and then compares the explanatory power of incrementally larger and smaller models. At each step, the p value of F-statistics [32] is computed to test models with and without selected feature. Based on the statistic result algorithm makes a decision whether feature should be included in a model or not. If a feature is not currently in the model, the null hypothesis is that the term would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, the feature is added to the model. Conversely, if a feature is currently in the model, the null hypothesis is that the term has a zero coefficient. If there is an insufficient evidence to reject the null hypothesis the term is removed from the model. The algorithm is interrupted if none of steps leads to the increase of the model accuracy. A.7. Feature correlation with class In this method, the direct correlation of the feature values with a class is examined. The discriminative value S(f) of the feature f for recognizing one class from the other K classes is defined as [25,30] Sðf Þ ¼

∑Kk ¼ 1 P k ðck cÞ2

σ 2 ðf Þ∑Kk ¼ 1 P k ð1 P k Þ

ðA4Þ

where c is a mean value of feature for all data, ck is a mean value of the feature for the kth class data, σ2(f) is a variance of feature, Pk is a probability of kth class occurrence in dataset (usually the uniform distribution is assumed). A.8. SVM recursive feature elimination In the SVM recursive feature elimination (SVM-RFE) the SVM network with linear kernel is used [24,34]. The network is learned applying all available features used simultaneously as input attributes. In the case of classifier the sign function is added for matching the input values to the appropriate class label. The output signal y at presentation of the features organized in the form of vector f is defined by the following equation yðfÞ ¼ sgnðuÞ ¼ sgnðwT f þ bÞ T

ðA5Þ T

where w ¼ [w1, w2,…, wn] is the weight vector, f ¼[f1, f2,…,fn] is a vector of features and b is a bias. Large absolute value of weight, which connects the feature f with the network, denotes a strong ability of this feature to distinguish two classes. In SVM-RFE approach the features are eliminated step by step since the SVM is retrained at each step with the population of features becoming gradually smaller. In the first step the linear SVM network is learned using all features. Then, the weights are adapted and sorted in a descending order. However, the features which are associated with the smallest absolute values of weights, are eliminated. Twenty percent of the actual number of genes are reduced in each step and the process is repeated until the required number of the features is obtained.

References [1] A. Bailey, W. Philips, Autism: toward an integration of clinical, genetic, neuropsychological and neurobiological perspectives, J. Child Psychol. Psychiatry 37 (1996) 89–126. [2] M.W. State, P. Levitt, The conundrums of understanding genetic risks for autism spectral disorders, Nat. Neurosci. 14 (2011) 1499–1506. [3] V. Hu, L. Yinglei, Developing a predictive gene classifier for autism spectrum disorders based upon differential gene expression profiles of phenotypic subgroups, N. Am. J. Med. Sci. 6 (2013) 107–116. [4] M.S. Yang, M. Gill, A review of gene linkage, association and expression studies in autism and in assessment of convergent evidence, Int. J. Dev. Neurosci. 25 (2007) 69–85. [5] S. Russell, L.A. Meadows, R.R. Russel, Microarray Technology in Practice, Academic Press, 2008. [6] M. Alter, R. Kharkar, K. Ramsey, D. Craig, R. Melmed, T. Grebe, R. Curtis-Bay, S. Ober-Reynolds, J. Kirwan, J. Jones, J. Blake-Turner, R. Hen, D. Stephan, Autism and increased patternal age related changes in global levels of gene expression regulation, PLoS One 6 (2011) 1–10. [7] T. Golub, et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531–537. [8] M. Eisen, P. Spellman, P. Brown, Cluster analysis and display of genome wide expression patterns, Proc. Natl. Acad. U.S.A. 95 (1998) 14863–14868. [9] I. Guyon, A.J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using SVM, Mac. Learn. 46 (2002) 389–422. [10] A. Wiliński, S. Osowski, Gene selection for cancer classification, COMPEL 28 (2009) 231–241. [11] P. Baldi, A.D. Long, A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes, Bioinformatics 17 (2001) 509–519. [12] X. Huang, W. Pan, Linear regression and two-class classification with gene expression data, Bioinformatics 19 (2003) 2072–2078. [13] S. Zheng, W. Liu, An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification, Comput. Biol. Med. 41 (2011) 1033–1040. [14] P.J. Woolf, Y. Wang, A fuzzy logic approach to analyzing gene expression data, Physiol. Genomics 3 (2000) 9–15. [15] X. Wang, O. Gotoh, A robust gene selection method for microarray-based cancer classification, Cancer Inf. 9 (2010) 15–30. [16] H. Mitsubayashi, S. Aso, T. Nagashima, Y. Okada, Accurate and robust gene selection for disease classification using a simple statistics, Biomed. Inf. 391 (2008) 68–71. [17] F. Yang, Robust feature selection for microarray data based on multicriterion fusion, IEEE Trans. Comput. Biol. Bioinf. 8 (2011) 1080–1092. [18] F. Esteban, D. Wall, Using game theory to detect genes involved in autism spectrum disorder, TOP 19 (2011) 121–129. [19] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [20] NCBI base 〈http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4431〉, 2011. [21] C Lord, M Rutter, A Le Couteur, Autism diagnostic interview-revised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders, J. Autism Dev. Disord. 24 (1994) 659–685. [22] D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, 1989. [23] R. Siroic, S. Osowski, T. Markiewicz, K. Siwek, Application of support vector machine and genetic algorithm for improved blood cell recognition, IEEE Trans. Meas. Instrum. 58 (2009) 2159–2168. [24] B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. [25] R.O. Duda, P.E. Hart, P. Stork, Pattern Classification and Scene Analysis, Wiley, New York, 2003. [26] 〈http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15402〉. [27] 〈http://datam.i2r.a-star.edu.sg/datasets/krbd/BreastCancer/BreastCancer. html〉. [28] Laura J. van Veer, et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (2002) 530–536. [29] J. Aarøe, T. Lindahl, V. Dumeaux, S. Sæbø, D. Tobin, N. Hagen, P. Skaane, A. Lönneborg, P. Sharma, A.L. Børresen-Dale, Gene expression profiling of peripheral blood cells for early detection of breast cancer, Breast Cancer Res. 12 (R7) (2010) 1–11. http://dx.doi.org/10.1186/bcr2472. [30] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1158–1182. [31] R. Robnik-Sikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF, Mach. Learn. 53 (2003) 23–69. [32] P. Sprent, N.C. Smeeton, Applied Nonparametric Statistical Method, Chapman & Hall/CRC, Boca Raton, FL, 2007. [33] Matlab user manual—Statistics toolbox, Natick, USA: MathWorks, 2013. [34] I. Guyon, A.J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using SVM, Mach. Learn. 46 (2002) 389–422.

CAFE: an R package for the detection of gross chromosomal abnormalities from gene expression microarray data.

A computerized perinatal data system.

A computerized system for storage, retrieval, and analysis of gene mapping data.

Regulation of gene expression in rats with spinal cord injury based on microarray data.

Gene expression profile analysis of pancreatic cancer based on microarray data.

A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information.

Evaluation of Different Normalization and Analysis Procedures for Illumina Gene Expression Microarray Data Involving Small Changes.

Clustering by fast search and merge of local density peaks for gene expression microarray data.

A review of feature extraction software for microarray gene expression data.

Robustification of Naïve Bayes Classifier and Its Application for Microarray Gene Expression Data Analysis.

Multivariate data analysis based on a computerized patient monitoring system.

Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification.

Principal components analysis and the reported low intrinsic dimensionality of gene expression microarray data.

A computerized system for recording data in gastrointestinal endoscopy.

Genetic programming based ensemble system for microarray data classification.

Microarray data and gene expression statistics for Saccharomyces cerevisiae exposed to simulated asbestos mine drainage.

Microarray Expression Data Identify DCC as a Candidate Gene for Early Meningioma Progression.

Cloud-scale genomic signals processing classification analysis for gene expression microarray data.

Trend analysis of intrapartum monitoring data: a basis for a computerized fetal monitor.

Improving PLS-RFE based gene selection for microarray data classification.

Evaluation of benefits derived from a computerized data management system for clinical trials data.

Test of four colon cancer risk-scores in formalin fixed paraffin embedded microarray gene expression data.

Classification of Microarray Data Using Kernel Fuzzy Inference System.

Gene expression microarray classification using PCA-BEL.