Med Biol Eng Comput (2014) 52:1041–1052 DOI 10.1007/s11517-014-1200-8

ORIGINAL ARTICLE

Feature selection and classification of leukocytes using random forest Mukesh Saraswat · K. V. Arya 

Received: 5 March 2013 / Accepted: 22 September 2014 / Published online: 5 October 2014 © International Federation for Medical and Biological Engineering 2014

Abstract  In automatic segmentation of leukocytes from the complex morphological background of tissue section images, a vast number of artifacts/noise are also extracted causing large amount of multivariate data generation. This multivariate data degrades the performance of a classifier to discriminate between leukocytes and artifacts/noise. However, the selection of prominent features plays an important role in reducing the computational complexity and increasing the performance of the classifier as compared to a highdimensional features space. Therefore, this paper introduces a novel Gini importance-based binary random forest feature selection method. Moreover, the random forest classifier is used to classify the extracted objects into artifacts, mononuclear cells, and polymorphonuclear cells. The experimental results establish that the proposed method effectively eliminates the irrelevant features, maintaining the high classification accuracy as compared to other feature reduction methods. Keywords  Leukocytes classification · Random forest · Gini importance · Dimensionality reduction · Feature selection

1 Introduction Leukocytes, also known as inflammatory cells or white blood cells (WBCs), are cells of the immune system which M. Saraswat (*) · K. V. Arya  ABV-Indian Institute of Information Technology and Management, Gwalior 474010, India e-mail: [email protected] K. V. Arya e-mail: [email protected]

defend the body against both infectious disease and foreign materials [24]. These cells are available in blood and infiltrated into tissues at the time of injury. Leukocyte consists of nucleus (bluish color when stained with hematoxylin stain) within cytoplasm (pinkish color with eosin in H&E staining method). As per the structure of nuclei, leukocytes are divided into two main categories: polymorphonuclear cells having granules in their cytoplasm and mononuclear cells which have single nucleus. Automatic identification of leukocytes are of paramount importance in the perspective of disease identification, its progress, and drugs development. However, very rare work has been done in the identification of leukocytes in tissue images due to their wide natural biological variability. The recent research indicates that leukocytes are being quantified manually in chicken skin model [29], mice skin model [28, 30], etc. Manual counting of leukocytes is a time-consuming process and requires a trained dedicated pathologist. The number of variables such as biased behavior of pathologists and variations in staining pose challenges for the research scientists in counting leukocytes manually [39, 41]. Therefore, to reduce these problems there has been a growing interest in developing tools for reducing human subjectivity, biasness, and time taken for quantification of leukocytes. Some useful methods have been developed for identification of leukocytes in blood smear images [22, 46] but rare efforts have been made in tissue section images. Therefore, the main goal of this work is to design and develop a pattern recognition method for the identification of leukocytes in tissue section images. Leukocytes identification method goes through the process of leukocytes segmentation, their feature extraction, and classification. But, automatic segmentation of leukocytes from the complex morphological background of tissue section images also extracts the vast number of artifacts/noise, thereby

13

1042

providing enormous amount of multivariate data. These multivariate data degrade the performance of a classifier to discriminate between leukocytes and artifacts. Rare work has been reported regarding the reduction in artifacts from the extracted objects of tissue section images. These artifacts are created due to improper handling prior to fixation, processing of tissue, its sectioning, and staining. Most of the artifacts are identified using their morphological characteristics mainly the nuclear characteristics. In general, these structures do not contain any nucleus, but a halo dark zone is misrepresented as nucleus. Some of the artifacts appear like tiny points within tissue. To reduce these tiny points prior to segmentation process, different methods have been used and the most general methods are median filtering and mathematical morphology [14]. In median filtering, pixel intensity is used to reduce the noise while shape characteristics of the objects are used in mathematical morphology. Mohapatra et al. [33] performed median filtering followed by unsharp masking to all the images before cell segmentation. Phukpattaranont and Boonyaphiphat [35] used morphological erosion and opening operations to eliminate spike noise and to simplify shape of cells. However, these methods may change the shape and size of actual leukocytes. Moreover, the artifacts whose shape and size may be similar to leukocytes cannot be removed using these median filtering and morphological operations. Recently, we have introduced an unsupervised method [38] based on differential evolution (DE) [45] to segment the leukocytes from the images of mice skin sections stained with hematoxylin and eosin (H&E) staining and acquired at 40× magnifications. The method performs an unsupervised multilevel clustering in two phases. First phase uses pixel intensity to extract the leukocyte-type objects from the complex background of skin tissue images. A vast number of artifacts are also segmented along with the leukocytes in this phase due to similarity of their intensity with leukocytes. These artifacts can only be differentiated from leukocytes on the basis of their morphological structure. Therefore, for the reduction of artifacts, second phase is introduced which performs unsupervised multilevel clustering using the feature vector consisting of four features for each extracted object. However, this process can be improved by introducing the supervised segmentation. Supervised artifacts reduction method may provide better results as they are based on learning and classification of objects. Therefore, this paper modifies the second phase of unsupervised method [38] through supervised binary random forest classifier to classify the extracted objects of first phase into noise, mononuclear cells, and polymorphonuclear cells. The performance of supervised classification methods highly depend on the selection of prominent features from a high-dimensional feature space. A high-dimensional feature

13

Med Biol Eng Comput (2014) 52:1041–1052

space may include some irrelevant or redundant features which may adversely affect the performance and complexity of the classifier. Therefore, the selection of optimal features is of vital importance to reduce the computational complexity and increase the performance of the classifier as compared to a high-dimensional feature space [3, 8]. A feature selection method finds the new feature subsets using an evaluation measure which scores the different feature subsets. An optimal or suboptimal feature subset can be found using various search methods. If there are N features, then total number of feature subsets will be 2N . An exhaustive search method, having complexity O(2N ), finds the optimal solution by searching all 2N feature subsets which becomes impractical for high-dimensional data sets [23]. Therefore, many feature selection methods have been developed to solve this problem [17] which are generally categorized into filters, wrappers, and embedded methods [7]. Filter methods consider the contribution of a set of features to the class variable [18]. These methods are computationally efficient but may not perform optimally for a given classifier. Wrapper methods use predictive models for the evaluation of feature subset and are more demanding than filter methods [7]. Sequential backward selection (SBS) [5] is one of the most commonly used wrapper methods which is based on greedy hill-climbing search method. SBS iteratively eliminates the least promising features one by one till the performance of learning model drops below a given threshold. In embedded methods, the information obtained from a supervised learner is used for the selection of features. Support vector machine with recursive feature elimination (SVM-RFE) [16] is a well-known example of embedded methods. SVM-RFE eliminates those features which have lowest weight obtained from a trained SVM. However, both the wrapper and embedded methods are computationally intensive methods. Recently, random forest (RF) [4] has been used for feature subset selection [9, 36, 37]. RF is based on growing ensemble of trees/classifiers using bootstrapped samples of training data set, which improves the classification accuracy significantly. RF effectively deals the numerical and categorical features, interactions, and nonlinearities. They require little data preprocessing like normalization due to invariance to feature scales and insensitive to outliers [7]. Dyaz-Uriarte and Alvarez de Andres [9] proposed an iterative method, varSelRF, to eliminate the irrelevant features. At each iteration, a fraction of the features having the smallest variable importance are eliminated. From these feature subsets, one is selected having smallest out-of-bag (OOB) error rate. Deng and Runger [7] proposed a regularized random forest (RRF) method which evaluated the features on a part of training data at each tree node. However, these methods are computationally intensive for a highdimensional features space.

Med Biol Eng Comput (2014) 52:1041–1052

Therefore, in this work, a novel and computationally efficient binary random forest feature selection (BRFFS) method based on Gini importance is introduced. Further, the selected feature subset is fed to RF classifier to classify the objects into artifacts, mononuclear cells, and polymorphonuclear cells. The performance of the proposed method is compared with existing popular feature reduction methods, viz. SVMRFE, RRF, VarSelRF, and SBS. Moreover, support vector machine (SVM), linear discriminant analysis (LDA), artificial neural network (ANN), k-nearest neighbor (kNN), ZeroR, and random forest (RF) classifiers have been used to analyze the performance of selected features on different data sets. Rest of the paper is organized as follows. A brief overview of RF and Gini importance are given in Sect. 2. The proposed methodology of feature selection and classification is described in Sect. 3. Algorithm validation is given in Sect. 4. In Sect. 5, experiments are performed and the results are analyzed. Finally, conclusion is presented in Sect. 6.

2 Random forest and Gini importance Breiman [4] developed an efficient classification method (RF) which is based on growing ensemble of trees/classifiers using bootstrapped samples of training data set. Random forest constructs the binary classification trees using training set having M input variables. From these M input variables, m (m ≪ M ) variables for entire forest are taken randomly at each node to split it, and the best split is chosen. This way each tree in the forest is grown without pruning step. The brief steps for RF is shown in Algorithm 1 [4]. Algorithm 1 Random Forest Comment: {Let M represents the total number of input variables that are to be used for growing ensemble of trees using bootstrapped samples of training data set.} 1. Randomly select m variables from M input variables to determine the decision at a node of the tree. 2. Select a bootstrap sample from the training set for this tree by choosing n times with replacement from all N available training cases. Rest of the cases are used to estimate the error of the tree by predicting their classes. 3. Calculate the best split based on these m variables in the training set. 4. Each tree is fully grown and not pruned. 5. For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. 6. Repeat step 1-5 over all the considered trees in the ensemble. 7. Calculate the average vote of all trees considered as random forest prediction.

RF is one of the most accurate and powerful learning methods which has produced highly accurate results for many data sets such as microarray [12, 19, 44], time series [42], spectra [31]. Although, RF is not widely used for the microarray data sets but its powerful attributes make

1043

it ideal for these data sets like RF can produce good results for the data sets which have more variables than observations and also perform better in the presence of noise [10]. An extensive study has been carried out by Lee et al. [26] to compare LDA, kNN, bagging trees, boosting, and RF on the seven microarray data sets. Out of these methods, RF was found to be most efficient method on the considered data sets. Diaz-Uriarte and Alvarez de Andres [10] showed that RF is able to select smaller gene sets from the microarray data set while preserving predictive accuracy as compared to LDA, kNN, and SVM classifiers. They observed that their RF-based gene selection method, for 10 different microarray data sets, returns very small set of genes while retaining predictive performance as compared to seven alternative state-of-the-art methods. Wu et al. [48] performed a thorough comparison study of ensemble methods (bagging, boosting, random forest) with individual classifiers (LDA, quadratic discriminant analysis, kNN, SVM) for matrix-assisted laser desorption/ionization with time-offlight (MALDI-TOF). They observed that RF, on average, gives the lowest error rate with the smallest variance. Klassen [21] also compared RF and SVM on four microarray cancer data sets and found RF better than SVM. Recently, RF-based methods have also been successfully used for achieving the real-time classification of the fractional masses in mass spectrometry experiments [20]. Moreover, Liu et al. [27] investigated T test, significance analysis of microarray (SAM), rank products (RP), and RFbased feature selection methods on acute lymphoblastic leukemia, acute myeloid leukemia, breast cancer, and lung cancer Affy data sets. They observed that SAM- and RFbased feature selection methods have the best classification performance as compared to other methods. Another important property of RF is that it generates an vital factor of feature importance also known as Gini importance (IG) of features which is calculated as follows [32].  IG (θ) = �iθ (τ , T ) (1) T

τ

where, θ is a particular feature for which Gini importance is calculated. �iθ (τ , T ) is the decrease in Gini impurity at each node τ within binary tree T. This value gives the information about the number of times the particular feature is selected for a split. Hence, this value provides the ranking of the features which is used to eliminate the features of less use for feature selection.

3 Proposed methodology The proposed leukocytes classification method consists of four phases: (1) First phase is similar to Saraswat et al. [38] which extracts the different objects (leukocytes and

13

1044

Med Biol Eng Comput (2014) 52:1041–1052

artifacts) using intensity information of the pixels, (2) second phase represents the extracted objects into the feature vectors having geometric and texture features, (3) third phase reduces the feature vector by selecting the prominent features using the proposed novel binary random forest feature selection (BRFFS) method, and (4) finally, the selected feature vectors are given to random forest classifier to classify the objects into artifacts, mononuclear cells, and polymorphonuclear cells. A detailed description of each phase has been presented in the following sections. 3.1 Intensity-based segmentation method [38] Pixel intensity is the primary source of information for an object in an image and is generally used for object segmentation. Therefore, this phase segments the objects of interest (leukocytes) from the complex background of inflamed mice skin section using DE-based unsupervised multi-level clustering [38]. Let a given RGB image consists of L intensity levels for each RGB plane. DE is used to calculate n − 1 thresholds [t1 , t2 , . . . , tn−1] for each RGB plane by maximizing the following objective function.   φ = max1

Feature selection and classification of leukocytes using random forest.

In automatic segmentation of leukocytes from the complex morphological background of tissue section images, a vast number of artifacts/noise are also ...
570KB Sizes 1 Downloads 7 Views