Int. J. Bioinformatics Research and Applications, Vol. 10, No. 6, 2014

613

Drawing inferences from clinical studies with missing values using genetic algorithm R. Devi Priya* and S. Kuppuswami Kongu Engineering College, Erode 638 052, Tamil Nadu, India Email: [email protected] Email: [email protected] *Corresponding author Abstract: Missing data problem degrades the statistical power of any analysis made in clinical studies. To infer valid results from such studies, suitable method is required to replace the missing values. There is no method which can be universally applicable for handling missing values and the main objective of this paper is to introduce a common method applicable in all cases of missing data. In this paper, Bayesian Genetic Algorithm (BGA) is proposed to effectively impute both missing continuous and discrete values using heuristic search algorithm called genetic algorithm and Bayesian rule. BGA is applied to impute missing values in a real cancer dataset under Missing At Random (MAR) and Missing Completely At Random (MCAR) conditions. For both discrete and continuous attributes, the results show better classification accuracy and RMSE% than many existing methods. Keywords: missing values; BGA; Bayesian genetic algorithm; MAR; missing at random; MCAR; missing completely at random; continuous attributes; discrete attributes. Reference to this paper should be made as follows: Devi Priya, R. and Kuppuswami, S. (2014) ‘Drawing inferences from clinical studies with missing values using genetic algorithm’, Int. J. Bioinformatics Research and Applications, Vol. 10, No. 6, pp.613–627. Biographical notes: R. Devi Priya is an Assistant Professor in Department of Information Technology in Kongu Engineering College. She had completed her BE from Bharathiyar University, ME and PhD from Anna University. She has published about ten papers in various international journals and conferences. Her research interests include data pre-processing, data mining and optimisation algorithms. S. Kuppuswami is the Principal at Kongu Engineering College. He has about 35 years of teaching and research experience. He has published more than 60 papers in various international journals and conferences. His areas of interest include Software Engineering, Software Architecture, Agent Technology and Pervasive Computing.

Copyright © 2014 Inderscience Enterprises Ltd.

614

1

R. Devi Priya and S. Kuppuswami

Introduction

The number of patients being affected by various diseases has been rising for some decades. Clinical studies are often conducted for patients to examine the influence of the selected variables and risk factors in their treatment outcomes. The results of such studies will help clinicians in better understanding of the cause and history of disease, choosing the appropriate treatment method and in predicting the accurate outcome of the treatment chosen. All studies have their own design and measurement characteristics which involve multiple variables from heterogeneous sources. Application of data mining and machine learning techniques on such multivariate data is an extremely challenging task (Cios and Moore, 2002). When the datasets contain detailed patient information and proper assumptions for parameters are made, the missing covariate values can still be imputed efficiently (Clark and Altman, 2003; Nelwamondo and Marwala, 2007). In medical research, the problem of missing data occurs frequently. Most statistical techniques need complete data for analysis which may not be available most of the time. In Wood et al., (2004), 89% (63 out of 71) of the clinical trials reported missing data. In another study, 81 out of 100 articles in 7 cancer journals reported missing data. In Parsons et al. (2011), the authors identified 72.8% among 97,230 patients have missing values in one or more attributes. They are noted to be missing often in healthier patients and critical staged patients. Dropouts are more common in clinical studies leading to missing data which adversely affect the calculations and interpretations made (Shih, 2002). Patients dropout from the studies at different stages. For example, some patients will be available only at the baseline level and remain missed for further assessments and some will be dropped at intermediate stages. Especially the patients do not return for the subsequent treatments due to reasons like death, taking treatment in other hospitals or they may get cured because of good response for the treatment given. Even if they return, some vital elements of clinical importance cannot be taken from them. The cost and effort spent for the prognostic studies on such dropouts will turn out to be useless. These kinds of dropouts cannot be completely avoided because every stage in the treatment made can cost the lives of patients and the only possibility for researchers is to make inference from the data that are available to them. Inferences made by considering only the available complete subset of dataset will also be not valid always. If the percentage of missingness is less, it will not have much impact on interpretation of the results. But if the percentage of missingness is more, the inferences made by ignoring the patients with incomplete information will result in reduction of valid cases used for analysis and thus will adversely affect the statistical power. Various methods are available to treat missing values but they differ in the way they analyse the available information and treat those missing values (Collins et al.,2002). The problem with existing methods is that the simple methods provide biased results and the advanced methods are often more difficult for the clinicians to implement. Hence we address this problem by introducing a method called Bayesian Genetic Algorithm (BGA) which maintains a balance between simplicity and efficiency. To illustrate performance of the proposed algorithm, it is being applied in a cancer dataset and the inferences from them are analysed. The results obtained from BGA are also compared with state of the art techniques used in handling missing values and are reported in this paper.

Drawing inferences from clinical studies with missing values using GA

2

615

Background

2.1 Missing data mechanisms Let D denote the dataset which contains attribute X constituting of (Xobs, Xmis), where Xobs denotes the observed variables, Xmis denotes the missing variables; M denotes the distribution of missing values. Little and Rubin (2002) classified missing data into three mechanisms based on the nature in which it is missing. Missing at Random (MAR): The missing values can be estimated from the observed values in D. P ( M | X )  P ( M | X obs )

Missing Completely at Random (MCAR): The missing values depend neither on the values observed nor the one which are missing in D. PM | X   PM 

Not Missing at Random (NMAR): The missing values have no dependency on the observed values but depend on the unobserved (missing) values. P  M | X   P  M | X miss 

MAR and MCAR are considered as ignorable methods which ignores the missingness whereas NMAR is a non-ignorable method which requires development of model which is difficult to analyse (Rubin, 1977). Hence this paper concentrates on MAR and MCAR conditions. Researchers are working in this problem for a long time and introduced many methods to impute the missing values. Some of the methods used are briefly described in National Research Council (2010).

2.2 Related works In Burton and Altman (2004), the authors reported that 32 out of 81 articles, with missing data they surveyed, treat the missing values with some techniques. Among them, the widely used method is available case analysis, where only the datasets that are completely available are used for analysis and the records with missing values are just ignored. But ignoring the incomplete values will lead to biased results, which is not an acceptable option when important life-risking decisions have to be taken from them (Janssen et al., 2010; Guan and Yusoff, 2011). It is reasonable to use this method only when percentage of missingness is less than 5% and cannot be used for highly sensitive datasets. Numerous methods are available to handle missing data. But some of the most commonly used methods are discussed in this paper. Mean of the available values can be substituted for the missing values if the attribute is numerical and mode can be used when discrete attribute value is missing. But these simple methods will not yield valid results and hence they are not generally recommended. Many researchers suggested MI which uses all the observed data to substitute for the missing values (Little, 1999; Schafer and Graham, 2002; Ali et al., 2011). MI is becoming popular in finding the missing values in medical records (Harrell, 2001; Faris et al., 2002; Clark and Altman, 2003; Donders et al., 2006; Moons et al., 2006). But even today many clinical researchers are unaware about this method and in spite of high statistical power, only very few articles are reported with the usage of MI. The challenges

616

R. Devi Priya and S. Kuppuswami

faced while using MI are that (i) it requires a model to substitute the plausible values, (ii) discrete missing values cannot be efficiently handled, (iii) non-monotone variables are harder to impute and (iv) it requires much expertise and statistical knowledge of the user to completely understand the trend of the data. Some other methods proposed by various researchers are expectation-maximisation algorithm (Dempster et al., 1977), genetic algorithm and auto associative neural networks (Abdella and Marwala, 2005), genetic algorithm and fuzzy rule sets (Chen and Huang, 2003), rough sets approach (Nelwamondo and Marwala, 2007), fuzzy approaches (Gabrys, 2002), Principal Component Analysis (PCA) and autoassociative neural networks (Mistry et al., 2009), probability density (Wang, 2008), genetic algorithm and multiple imputation (Patil and Bichkar, 2010). There is no commonly applicable method for handling missing clinical data. Different methods work better in different datasets and hence the analysts hold the prime responsibility in applying the suitable method for the problem.

3

Bayesian genetic algorithm

Any method applied to impute the missing values should have three important aspects namely (i) the method should yield unbiased estimates, (ii) appropriate method should be present to assess the degree of uncertainty and (iii) it should have good statistical power (Graham, 2009). All these aspects are given importance in Bayesian genetic algorithm. We propose to use genetic algorithm combined with Bayes rule to effectively calculate the values of missing data. The striking feature about genetic algorithm is the richness of its computation process. The small and simple modifications made in every generation often gradually lead to surprising desired results. Even though it looks like a random method, the results produced over generations are not random but they evolve to be optimal. It is proven in many literatures that if sufficient number of generations are allowed with incorporated prior knowledge, genetic algorithm will definitely result in optimal solution. Moreover applying genetic algorithm does not need more statistical knowledge from the user.

3.1 Genetic algorithm Genetic algorithm is the process of deriving optimal solution from a pool of available solutions based on the ideas of natural evolution. It starts with defining the structure of chromosomes which constitutes collection of genes. Fitness values of all chromosomes have to be estimated using fitness function which defines the capability of each chromosome in producing better individuals for the next generation. The success of GA lies in how well the fitness function is defined because it holds the prime responsibility of identifying the best chromosomes to be carried over to the next generation. Based on the fitness values calculated, best parents are selected using mechanisms like roulette wheel, rank, selection mechanism, etc. to replace the worst chromosomes. Genetic operators like crossover and mutation are then applied. Crossover is the process where genes in both parent chromosomes are exchanged and new combinations of off-springs are produced with predefined Crossover Probability (Pc). There are many crossover mechanisms like one point, two point, uniform crossover, etc. Mutation is done to change the values of genes in the chromosomes with very low probability in order to prevent concentration of

Drawing inferences from clinical studies with missing values using GA

617

chromosomes in a single region over the solution space to avoid premature convergence. It is a process where some genes (attribute values) in the chromosomes are altered randomly. The structure of genetic algorithm is given below in Figure 1. Figure 1

Structure of genetic algorithm Fitness Estimation

Encode chromosome structure

Parent Selection

Crossover Population Terminate?

Mutation

N Y Return the solution

3.2 Bayes theorem Bayes theorem is a very simple one which uses the concept of conditional probability and has been used effectively in imputing missing values (Boone, 2003). It estimates the probability by counting the occurrence of combination of values which are of interest to the user. It holds for both discrete and continuous values. Suitable formula can be selected when trying to impute the missing attribute values. Let Y be any effect and X the given condition, Bayes rule states that

P Y | X  

P  X | Y  P Y  PX 

where P(Y|X) is the posterior (conditional) probability and P(Y) indicates prior probability. In the proposed approach for both MAR and MCAR, Bayes rule is used to calculate fitness values of the chromosomes. Under MAR assumption, the attributes under study are correlated and if value of an attribute is reported to be missing, it can be inferred by using the correlated variables. Under MCAR assumption, even though the values are missing totally at random, the missing values cannot be calculated without considering distribution of other variables. Hence all significant attributes are included in the chromosome structure by discarding the unnecessary attributes. The importance of Bayes’ rule in probabilistic models is its simple form which is capable of producing better solutions. Bayes’ rule has been extensively used already in imputing the missing values in many clinical studies (Andersen, 2007). There are many advantages of using Bayesian approach in missing values. Mostly all possible values of missing observations are utilised for analysis rather than one point thereby preventing GA from getting stuck in local optimum. It is also easy to include restrictions on the Bayesian parameters.

618

R. Devi Priya and S. Kuppuswami

3.3 Fitness calculation Let X denote the missing attribute which is to be estimated and Y denote the covariate vector with attributes {l, m} which can be used to infer values of X. Any number of covariates can be used in the vector Y. But for simplicity, only two covariates l and m are assumed in the following definitions. The type of both missing variable and its covariates can be either continuous or discrete or combination of both types. The fitness functions used in BGA for six such different combinations of attributes are given below. Definition 1: Missing value X and all covariates Y (l, m) are discrete. The missing discrete attribute X may depend on other discrete attributes l and m which is given by the vector Y{l, m}. P(X = x|Y = l, m) represents the probability of X having the value of x when Y constituting l and m takes their respective discrete class values. In such cases, the formula to be used here is given in Equation (1). P  X  x | Y  l, m  

P Y  l , m | X  x  P  X  x 

(1)

P Y  l , m 

where P(x) stands for calculating the probabilities. For example, a dataset contains discrete attributes age {young, middle-aged, old}, smoker {0, 1} and Blood pressure {low, average, high}. In Figure 2, the sample chromosome is given. Figure 2

Sample chromosome

Young

0

Age

average

Smoker

Blood pressure

Suppose age value for a record whose covariate vector {l, m} having values of 0 for smoker (l) and average for blood pressure (m) is missing, then the population for BGA is initialised with chromosomes having smoker = 0 and blood pressure = average and all age values in the dataset. The chromosome which has the highest fitness value is selected as the solution and the corresponding value for the attribute age is substituted for the missing value. Definition 2: Missing value X is discrete and the covariates Y have mixed type (l is continuous and m is discrete). The missing discrete attribute X may sometimes depend on both continuous and discrete attributes. This case uses Equation (2) to calculate the fitness function.

P  X  x | Ylm  

fYlm (Y | X  x) P  X  x 

(2)

fYlm Y 

where fYlm (Y | X  x)   f X  x Yl  and fYlm Y    f y Yl  m

m

Drawing inferences from clinical studies with missing values using GA

619

f(x) stands for calculating the Probability Density Function (pdf). If continuous variables are involved in the imputation process, they can be modelled as normal distribution. Depending upon the number of continuous attributes, the normal distribution can be varied. Univariate normal distribution can be used if one continuous attribute is involved in analysis. Binomial and multivariate distribution can be used for two and multiple attributes, respectively. In this case, l is a continuous attribute and hence pdf is calculated wherever it is used. Since m is a discrete one, summation is calculated. In fYlm (Y | X  x)   f X  x Yl  , m

first pdf is applied for l and posterior probability is estimated for its results. For fYlm Y    f y Yl  , prior probability is used instead. m

Definition 3: Missing value X is discrete and all covariates Y (l, m) are continuous. The missing discrete attributes may depend on other continuous attributes. It uses Equation (3) to calculate the fitness function. P  X  x | Ylm  

fYlm (Y | X  x) P  X  x 

(3)

fYlm Y 

where

fYlm (Y | X  x)  f X  x Ylm  and  y   fY Ylm  Since both l and m are continuous, pdf is applied for both of them with value of attribute X taking only x. For estimating fY lm ( y ) , it simply estimates pdf using l and m. Definition 4: Missing value X is continuous and all covariates Y(l, m) are continuous. In some cases, both missing value and all the correlated variables are continuous. Equation (4) holds for this case. f ( X | Ylm ) 

fYlm (Y | X  x) f  X  x 

(4)

fYlm Y 

where both numerators fYlm (Y | X  x) and f  X  x  take bivariate normal distribution

given by f X  x  Yl , m  and f(Ylm), respectively.

Definition 5: Missing value X is continuous and the covariates Y have mixed type (l is continuous, m is discrete). In some databases, the missing value may be continuous but it may be dependent on both continuous and discrete attributes. The covariate vector is similar to the one described in definition 2 with some minor difference. This case is more prevalent in real datasets which calculates the fitness values based on Equation (5). f ( X | Ylm ) 

fYlm (Y | X  x) f  X  x 

(5)

fYlm Y 

where fYlm (Y | X  x)   f X  x  Yl  and fYlm Y    fY Yl  m

m

620

R. Devi Priya and S. Kuppuswami

pdf is estimated for attribute l and summation is then applied to the result with respect to attribute m. Definition 6: Missing value X is continuous and all covariates Y (l, m) are discrete. Equation (6) is used to calculate the fitness values if the continuous attribute value is missing which depends on discrete covariates. Here summation is being applied for both l and m since both are discrete using pdf values of X = x. f ( X | Ylm ) 

fYlm (Y | X  x) f  X  x  P Ylm 

(6)

where fYlm (Y | X  x)   f X  x  X  Yl Ym

The general outline of the proposed Bayesian genetic algorithm is given below. FOR each MVi in D Cnt = 1; // Number of iterations REPEAT Step 1: Initialise the population Pi with j chromosomes // j – oopulation size Step 2: Calculate the fitness values of all chromosomes in the population Pi based on Bayesian values FOR each chromosome Cij in Pi F(Cij) is got from Equation (1) // If missing value and the covariate attributes are all discrete F(Cij) is got from Equation (2) // If missing value is discrete and the covariate attributes are both discrete and continuous F(Cij) is got from Equation (3) // If missing value is discrete and the covariate attributes are all continuous F(Cij) is got from Equation (4) // If missing value is continuous and the covariate attributes are continuous F(Cij) is got from Equation (5) // If missing value is continuous and the covariate attributes are both continuous F(Cij) is got from Equation (6) // If both missing values and covariate attributes are continuous END FOR Step 3: Select the best parents (chromosomes) based on the fitness values Step 4: Perform crossover to generate new offsprings with probability (Pc) Step 5: Perform mutation with probability (Pm) Cnt++; UNTIL Cnt = S (S – Stopping criterion is reached).

The significant property to be noted in all the formula used here is that if the attribute is discrete, the probabilities can be directly taken and if it is continuous, probability density function is used. If probability values are considered, the denominator takes the summation of all class variables and if density function is used, the integral formula is applied. The advantage here is that application of BGA is very simple and does not need more statistical knowledge.

Drawing inferences from clinical studies with missing values using GA

4

621

Implementation

BGA can be applied to all types of datasets with missing values. Here BGA is implemented for the real dataset collected from 608 patients admitted in a cancer hospital for treatment in Erode district of Tamil Nadu. The dataset contains many attributes but we consider only the following attributes in Table 1 which are more relevant to our study. Table 1

Attributes and their types S. No.

Attribute

Type of attribute

1

Age

Continuous

2

City

Discrete

3

Gender

Discrete

4

Cancer Type

Discrete

5

RT/CT

Discrete

6

Recurrence

Discrete

7

Weight

Continuous

When complete information is available for all the patients, any inference made from them may be valid. But in cancer research, dropouts are more common which is almost unavoidable. Without knowing the effects of treatment given to the dropouts, the results of such studies will be incomplete and inaccurate. Hence BGA can be applied to impute these missing values. It is very difficult to prove the efficiency of any imputation method without knowing the original values. So, we take some complete records and deleted the entries randomly to check the performance of BGA at all kinds of distribution. The values of the deleted entries are then calculated by using BGA and the accuracy of prediction is evaluated against the original values. The performance measures used to test the accuracy of imputation are correlation coefficient (r) and Root Mean Square Error (RMSE) for continuous attributes (Zhu et al., 2011). RMSE gives the average magnitude of the error by calculating the differences between estimated and original values. The degree of relationship between the actual values and estimated values is given by correlation coefficient (r). For discrete attributes, the classification error percentage is calculated. Different population sizes, selection mechanisms, crossover mechanisms, crossover and mutation probabilities are attempted. Even though there is not much difference between their values; the final results vary only slightly. Hence our experiments are conducted with following best genetic parameters identified. Population size Selection

: :

50 Rank selection

Crossover Crossover probability Mutation probability

: : :

One point crossover 0.80 0.03

622

5

R. Devi Priya and S. Kuppuswami

Results and discussion

In the dataset containing 608 records, the experiments were conducted with different missing rates ranging from 5% to 60% for both MAR and MCAR conditions. The dataset was analysed in three aspects.

5.1 Case 1 Depending upon health conditions of the patients and cancer incidence, treatments given may also vary. They may be treated with radiotherapy, chemotherapy or chemoradiotherapy. The recurrence pattern of the disease after treatment have to be studied which may help them to choose the appropriate treatment method. Such kinds of prognostic studies are really important to improve the treatment method thereby improving the cancer survival rate. Given, details of the patients and their treatments (RT, CT, RT+CT), the objective is to study the recurrence pattern of cancer after being treated. Due to dropouts, missing values in recurrence attribute are more prevalent. They can be filled with the help of correlated variables like age, gender, cancer type and treatment type. The chromosome for this case looks like the one given in Figure 3. Figure 3

Sample chromosome for case 1

Recurrence

Age

Gender

Cancer type

Treatment type

BGA and other existing methods are employed to effectively impute the values and the results for MAR and MCAR are given below in Table 2. Table 2

Classification accuracy of BGA and other methods under MAR and MCAR conditions in imputing recurrence pattern (discrete attribute) at different missing rates MAR

MCAR

Mode

MI

BGA

Mode

MI

BGA

6.45

2.30

1.78

6.89

2.87

1.64

10%

9.63

4.46

2.45

9.71

4.69

2.57

15%

14.78

6.85

3.27

13.57

7.05

2.98

5%

20%

19.25

8.60

4.26

20.35

8.04

4.52

30%

23.14

11.52

6.34

24.78

12.69

6.89

40%

26.07

17.74

8.70

27.69

18.13

9.46

50%

31.62

19.36

9.45

31.47

20.24

10.24

60%

39.74

22.16

10.36

40.88

23.56

11.53

Case 1 results fill out the recurrence pattern of the missing values. BGA tends to produce better results when compared with mode imputation and MI. For 5% missing data in MAR, BGA showed RMSE value of only 1.78% compared to MI which had RMSE of 2.30%. Most of the statisticians demonstrated in literatures that MI works better than other methods. But table clearly shows that BGA tends to produce much better results than MI even with missing rates as high as 60%. Mode imputation produced less accurate estimates compared to both multiple imputation and BGA for all the missing rates. For MCAR, when comparing these three methods, BGA shows better performance where it

Drawing inferences from clinical studies with missing values using GA

623

produces RMSE of 1.64% for 5 % missingness and for 60% missingness, BGA has only about RMSE of 11%. The results obtained from BGA clearly confirm the results of many studies which confirm that the patients who are treated with radiotherapy combined with chemotherapy have less recurrence than the patients treated with chemotherapy or radiotherapy alone.

5.2 Case 2 The objective here is to estimate the missing values of cancer type provided the values of age, gender, and city are available. The chromosome includes the missing variable and its covariates like the one given below in Figure 4. Figure 4

Sample chromosome for case 2

Cancer type

Age

Gender

City

This kind of analysis will greatly help clinicians in determining the group of people living in specific area and age who are more prone to a particular type of cancer. The classification accuracy rate of BGA, mode imputation and MI are given in Table 3 for both MAR and MCAR. Table 3

Classification accuracy of BGA and other methods under MAR and MCAR condition in imputing cancer type (discrete attribute) at different missing rates MAR

MCAR

Mode

MI

BGA

Mode

MI

BGA

5%

6.45

2.30

1.78

6.89

2.87

1.64

10%

9.63

4.46

2.45

9.71

4.69

2.57

15%

14.78

6.85

3.27

13.57

7.05

2.98

20%

19.25

8.60

4.26

20.35

8.04

4.52

30%

23.14

11.52

6.34

24.78

12.69

6.89

40%

26.07

17.74

8.70

27.69

18.13

9.46

50%

31.62

19.36

9.45

31.47

20.24

10.24

60%

39.74

22.16

10.36

40.88

23.56

11.53

BGA fills missing values of cancer type by making use of other attribute values. There are 24 types ofcancer involved in the dataset. The results showed that BGA results in more classification accuracy for attributes even with large number of classes (24) like cancer type. BGA showed comparatively much better performance than other missing data handling methods for both MAR- and MCAR-type of missingness. Even with 60% of missingness, for both MAR and MCAR, BGA resulted in

Drawing inferences from clinical studies with missing values using genetic algorithm.

Missing data problem degrades the statistical power of any analysis made in clinical studies. To infer valid results from such studies, suitable metho...
201KB Sizes 0 Downloads 4 Views