Seminal quality prediction using data mining methods.

Technology and Health Care 22 (2014) 531–545 DOI 10.3233/THC-140816 IOS Press

531

Seminal quality prediction using data mining methods Anoop J. Sahooa and Yugal Kumarb,∗

a System

Engineer, Infosys Technologies, Chennai, India of Information Technology, Birla Institute of Technology, Mesra, Ranchi Jharkhand, India

b Department

Received 7 November 2013 Accepted 24 March 2014 Abstract. BACKGROUND: Now-a-days, some new classes of diseases have come into existences which are known as lifestyle diseases. The main reasons behind these diseases are changes in the lifestyle of people such as alcohol drinking, smoking, food habits etc. After going through the various lifestyle diseases, it has been found that the fertility rates (sperm quantity) in men has considerably been decreasing in last two decades. Lifestyle factors as well as environmental factors are mainly responsible for the change in the semen quality. OBJECTIVE: The objective of this paper is to identify the lifestyle and environmental features that affects the seminal quality and also fertility rate in man using data mining methods. METHOD: The five artificial intelligence techniques such as Multilayer perceptron (MLP), Decision Tree (DT), Navie Bayes (Kernel), Support vector machine + Particle swarm optimization (SVM + PSO) and Support vector machine (SVM) have been applied on fertility dataset to evaluate the seminal quality and also to predict the person is either normal or having altered fertility rate. While the eight feature selection techniques such as support vector machine (SVM), neural network (NN), evolutionary logistic regression (LR), support vector machine plus particle swarm optimization (SVM + PSO), principle component analysis (PCA), chi-square test, correlation and T-test methods have been used to identify more relevant features which affect the seminal quality. These techniques are applied on fertility dataset which contains 100 instances with nine attribute with two classes. RESULTS: The experimental result shows that SVM + PSO provides higher accuracy and area under curve (AUC) rate (94% & 0.932) among multi-layer perceptron (MLP) (92% & 0.728), Support Vector Machines (91% & 0.758), Navie Bayes (Kernel) (89% & 0.850) and Decision Tree (89% & 0.735) for some of the seminal parameters. This paper also focuses on the feature selection process i.e. how to select the features which are more important for prediction of fertility rate. In this paper, eight feature selection methods are applied on fertility dataset to find out a set of good features. The investigational results shows that childish diseases (0.079) and high fever features (0.057) has less impact on fertility rate while age (0.8685), season (0.843), surgical intervention (0.7683), alcohol consumption (0.5992), smoking habit (0.575), number of hours spent on setting (0.4366) and accident (0.5973) features have more impact. It is also observed that feature selection methods increase the accuracy of above mentioned techniques (multilayer perceptron 92%, support vector machine 91%, SVM + PSO 94%, Navie Bayes (Kernel) 89% and decision tree 89%) as compared to without feature selection methods (multilayer perceptron 86%, support vector machine 86%, SVM + PSO 85%, Navie Bayes (Kernel) 83% and decision tree 84%) which shows the applicability of feature selection methods in prediction. CONCLUSION: This paper lightens the application of artificial techniques in medical domain. From this paper, it can be concluded that data mining methods can be used to predict a person with or without disease based on environmental and lifestyle parameters/features rather than undergoing various medical test. In this paper, five data mining techniques are used to predict the fertility rate and among which SVM + PSO provide more accurate results than support vector machine and decision tree. Keywords: Particle swarm optimization, multilayer perceptron, seminal, support vector machine ∗ Corresponding author: Yugal Kumar, Department of Information Technology, Birla Institute of Technology, Mesra, Ranchi Jharkhand, India. E-mail: [email protected].

c 2014 – IOS Press and the authors. All rights reserved 0928-7329/14/$27.50

532

A.J. Sahoo and Y. Kumar / Seminal quality prediction using data mining methods

1. Introduction Fertility rates have a notable decline in last two decades [1–3]. It has been observed that this decline is due to changes in behavior related to economic aspects, incorporation of women into labour and the ensuing delay in the age at which a person decides to have offspring, and the pervasive use of contraceptives [4,5]. Although it is clear that the social aspects have been contributing significantly to global decline in fertility rate with the deterioration of reproductive health caused by adverse biological factors [6]. In the past decades, Elisabeth Carlsen [7] performed a meta-analysis on the possibility of decline in seminal quality. Several studies show a decrease in semen parameters of men [8,9] and there are also some studies which found no evidence of that decline [10,11]. Probable decline in semen quality is due to several factors such as occurrence of male reproductive diseases [12,13], environmental or occupational factors [14,15], certain lifestyle [16,17]. Semen analysis is the keystone of the male study [18]. Semen analysis cannot determine alone whether a male can have offspring or not but it is a good predictor of male fertility potential [19]. Semen analysis is also necessary to evaluate candidates to become semen donors [20]. In the last two decades, the use of AI has also become widely accepted in medical applications. Many of these applications have advanced to the materialization of Expert Systems and Decision Support Systems in several different areas. To identify the hypertension patients and find the crucial parameters for the prediction whether a person is affected or not, Chao-Ton Su et al. [21] have used support vector machine classifiers for feature selection and evaluate the performance of SVM classifier with accuracy, sensitivity and specificity parameters in which SVM exhibits better performance than back propagation neural network. Yadav et al. [22] applied the decision stump, logistic regression and sequential minimization optimization classifiers for predicting the Parkinson diseases affected people and evaluated the performance of these classifiers with accuracy, sensitivity and specificity parameters in which logistic regression model provides the best results. Murat Karabatak et al. [23] have proposed a new model based on the association rule and neural network to diagnose the breast cancer diseases. To evaluate the performance of above said model, 3 fold methods are used and compared it with neural network model in which association rule plus neural network model gives better classification rate. Orhan Er et al. [24] have developed two multilayer neural network model for the diagnosis of the tuberculosis disease in which one model has used a single hidden layer while another model has used two hidden layers and the performance of these models is measured on the accuracy parameter in which Multi Layer Neural Network with two hidden layers provides better accuracy than other models. Kumar and Sahoo [25] have proposed a rule base classification model for predicting the different types of liver diseases. The rule base classification model is the combination of rules and different data mining techniques in which rules have derived using standard test result values and data mining techniques are used to assess the performance of model. It is found that decision tree method provide better results among rule induction, support vector machine, artificial neural network and naive bayes. Gil et al. [26] have applied decision trees, multilayer perceptron and support vector machines methods in order to evaluate their performance in the prediction of the seminal quality from the data of the environmental factors and lifestyle. The results conclude that multilayer perceptron and support vector machine shows the highest accuracy with prediction accuracy values of 86% for some of the seminal parameters. This paper deals with the study of male fertility rate, and also applies the artificial intelligence techniques to predict the male fertility rate and find out environmental factors and life habits parameters which affect the semen quality.


533

2. Feature selection approaches Features can be defined as attributes, properties, variables, or characteristics of a dataset. Feature selection (also known as variable selection) is one of the important research areas of application for which datasets with tens or hundreds of thousands of variables are available. Feature selection problems are found in many machine learning tasks including classification, regression, time series prediction, etc. It can be used to identify a subset of features from a given set of features such that an appropriate feature selection can enhance the effectiveness and domain interpretability of an inference model. Generally, a weight function is assigned to each features and the value of weight function is used whether features are crucial or not. Liu and Motoda [27] stated that the effects of feature selection are – To improve performance (speed of learning, predictive accuracy, or simplicity of rules) – To visualize the data for model selection – To reduce dimensionality and remove noise Feature selection algorithms can be divided into three categories: filters, wrappers, and embedded [28, 29]. Both wrappers and filters have encountered some success with induction tasks, but these techniques can be computationally very expensive tasks with a larger number of variables. Although the embedded method has lower computational cost as compared to the above, the exhaustive search is not good for dealing with the large features. Simply stated, these three methods may suffer from a block of wasting computational cost when variables are too large. In this paper, eight different feature selection methods are used to find out appropriate number of attributes that can predict the male fertility rate more accurately. The feature selection methods are Chi-square, SVM, Neural network, evolutionary logistic regression (LR), SVM + PSO, correlation and PCA. Hence, each method has assigned a weight value to every attribute of fertility dataset. The weight values describe the importance of the attributes and attributes are ranked according to its weight values. Table 1 shows the weight value associated with each feature of fertility dataset which are used to find the more relevant features. The features are ranked on the basis of weight values obtained from feature selection methods. For example in Chi-square test, age attribute has obtained maximum value when the ranking of the attribute is 1 and the childish diseases have minimum value so the ranking of this attribute is 9. 3. Data mining techniques 3.1. Decision trees Decision tree (DT) is a classifier that forms a tree structure (see Fig. 1), in which each node is either a leaf node that indicates the value of the target attribute (class) or a decision node, which specifies several test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. Decision trees are powerful and more popular tools for classification and prediction. In contrast to other AI methods the advantage of decision trees is that they represent rules. Rules can be easily expressed so that everyone can understand them or can be directly used in a database. In some applications areas, the main concern is the accuracy of a classification or prediction technique and other things does not matter or in such situations we do not concern about how the model works. While, in other situations, the ability to explain the reason for a decision is crucial. For instance, to predict the male fertility potential, particular one describes the factors involved, so that others can utilize this knowledge for successful prediction. So, the domain experts must identify or modeled the crucial parameters and

534


Table 1 Shows weight calculated for each attribute of fertility dataset using different feature selection methods and ranked to each attribute according weight value Feature selection Attributes techniques Childish Smoking Surgical High Accident or Season Alcohol Number of hours diseases habit intervention fevers serious trauma consumption spent sitting SVM 0.107 0.332 1 0 0.438 0.255 0.566 0.218 Rank 8 5 1 9 3 6 2 7 NN 0.007 0.011 0.012 0.004 0.022 0.009 0.065 0.017 Rank 8 6 5 9 3 7 1 4 Evolutionary LR 0.316 1.583 3.179 0.13 2.165 2.846 1.299 0.857 Rank 8 6 1 9 4 3 6 7 SVM + PSO 0.085 1.069 0.853 0.106 1.009 1.367 1.68 0.707 Rank 9 3 6 8 5 2 4 7 Correlation 0.104 0.138 0.156 0.112 0.712 1 0.733 0.188 Rank 9 7 6 8 3 1 2 5 PCA 0.019 0.738 0.297 0.107 0.138 1 0.079 0.139 Rank 8 2 3 6 5 2 7 4 T-Test 0 0.65 0.592 0 0.161 0.059 0.151 0.821 Rank 8 2 3 8 5 7 6 1 CHI-SQUARE 0 0.079 0.058 0.003 0.105 0.208 0.221 0.546 Rank 9 6 7 8 5 4 3 2 Avg. Wt. Values 0.07975 0.575 0.7683 0.05775 0.5937 0.843 0.5992 0.4366 Rank 8 6 3 9 5 2 4 7

Age 0.365 4 0.039 2 3.037 2 1.698 1 0.555 4 0 9 0.254 4 1 1 0.8685 1

Table 2 Shows parameters setting for DT technique Parameters Criterion Minimal size of split Minimal leaf size Minimal gain Maximal depth Confidence

Values Information gain 2 2 0.01 20 0.5

approve this discovered knowledge. However, decision trees (DTs) are capable of providing a good description to the life habits of the populations and therefore interpret problems according to the principles of mathematical and statistical principles [30]. The structure of the decision tree is composed of the several nodes, among these nodes one node acts as root node (The node has only outgoing edges not incoming edge); some nodes act as internal or intermediate nodes and rest of nodes act as leaf nodes or decision nodes. The intermediate nodes have only one incoming edge and these nodes are responsible to perform the test function. The leaf node is used to predict the class label and the probability is assigned to this class. In the decision tree, it is assumed that the class label must be categorical and Boolean. The DT algorithm family includes classical algorithms, such as ID3 [31], C4.5 [32], and CART [33]. Parameters setting for DT technique are given in Table 2. 3.2. Artificial neural network – Multilayer perceptron Multi-layer perceptron (MLP) [34–36] is consisted of multiple layers of neurons, but minimum three which are described as an input layer that receives external inputs, one hidden layer, and an output layer which generates the classification results (see Fig. 2). Apart from this, for the input layer every neuron in the other layers act as a computational element with a nonlinear activation function. The principle of the


535

Fig. 1. Decision tree representation for fertility dataset. (Colours are visible in the online version of the article; http://dx.doi.org/ 10.3233/THC-140816)

Fig. 2. Shows the structure of multilayer perceptron using an input layer, hidden layer and output layer. The input layer represents the attributes of datasets, working of hidden layer represents the attributes of datasets which are not linearly separated and output layer provides desired results. A threshold node is also added in input layer which specifies the weight function. (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/THC-140816)

neural network is that when data are available at the input layer, the network neurons run calculations in the consecutive layers until an output value is obtained at each of the output neurons. The output of neural network specifies the appropriate class for the input data. Each neuron (see Fig. 2) in the input and the hidden layers is connected to all neurons in the next layer by some weight values. The neurons of the hidden layers (see Fig. 2) compute weighted sums of their inputs and add a threshold. The resulting sums are used to obtain the activity of the neurons by applying a sigmoid activation function. This process is defined as follows: pj =

n i=1

wj,i xi + θj ,

mj = fj (pj )

536

A.J. Sahoo and Y. Kumar / Seminal quality prediction using data mining methods Table 3 Parameters setting for MLP technique Parameters Type of network Hidden layers Training cycles Learning rate Momentum

Values BPN 3 600 0.3 0.25

where pj is the linear combination of inputs x1 , x2 , . . . , xn , and the threshold θj , wji is the connection weight between the input xi and the neuron j, and fj is the activation function of the jth neuron, and mj is the output. The sigmoid function is a common choice of activation function. It is defined as f (t) =

1 1 + e−t

However, a single neuron in the MLP has ability to linearly separate its input space into two subspaces using a hyper plane defined by the weights and the threshold. The weights define the direction of this hyper plane. To train the MLP, back propagation learning method has been used [37] which is a gradient descent method for the adaptation of the weights. All the weight vectors “w” are initialized with small random values from a pseudorandom sequence generator. Parameters setting of MLP have given in Table 3 which is used in this paper to obtain desired results. 3.3. Support Vector Machine (SVM) SVM has been developed by Vapnik in 1995 [38] to solve pattern recognition problems and satisfy the following conditions: T w φ (xi ) + b +1 if yi = +1, (1) wT φ (xi ) + b −1 if yi = −1 In SVM, data is mapped on higher dimensional input space and produce an optimal separating plane in input space. More descriptions on SVM can be found in [39,40]. For a two class classification problem, it is required to estimate a function f : RN → {±1} using training data which are (M × N)-dimensional patterns xi and class labels yi such that f will classify new samples (x,y) correctly. For a given classification problem the SVM classifier, which is equivalent to (x1 , y1 ) , . . . (xM , yN ) : RN → {±1} ↔ yi [wT ϕ (xi ) + b] 1,

i = 1, 2, . . . M.

(2)

Here training vectors xi are mapped into a higher dimensional space by the function ϕ. The Eq. (2) construct a hyper plane wT ϕ (xi ) + b = 0. This higher dimensional space discriminates between the two classes. Each of the two half spaces defined by this hyper plane corresponds to one class, H1 for yi = +1 and H2 for yi = −1. Figure 3 shows the mapping of input space and feature space in a two class problem using SVM. Therefore the SVM classifier corresponds to decision functions is described as follow. y(x) = sign[wT ϕ (xi ) + b]


537

Fig. 3. Mapping of input space and feature space in a two class problem using SVM. Table 4 Parameters setting for SVM method Parameters SVM type Kernel type Gamma Cost (C) Epsilon (E) Class weight (W) Degree (D)

Values C-SVC RBF 0.3 0.95 0.001 0.45 0.85

Table 5 Parameters setting for SVM-PSO method Parameters Kernel type Kernel gamma Max. evaluations Population size Inertia weight Local best weight Global best weight

Values Radial 1 500 10 0.7 0.3 0.9

Thus the SVM finds a linear separating hyper plane with the maximal margin in this higher dimensional space. The margin of a linear classifier is the minimal distance of any training point to the hyper plane which is the distance between the dotted lines H1 and H2 and the solid line which is shown in Fig. 3. The points x which lie on the solid line satisfy wT φ (xi ) + b = 0, where w is normal to the hyper plane, |b|/||w|| is the perpendicular distance from the hyper plane to the origin, and ||w|| is the Euclidean norm of w. 1/||w|| is the shortest distance from the separating hyper plane to the closest positive (negative) example. Therefore, the margin of a separating hyper plane will be 1/||w|| + 1/||w||. To calculate the optimal separating plane is equivalent to maximizing the separation margin or distance between the two dotted lines H1 and H2 . Table 4 provides the parameter setting of SVM method. 3.4. Support vector machine and particle swarm optimization (SVM-PSO) based method In SVM- PSO classifier, PSO is applied to solve the dual optimization problem in the implementation of SVM [43,44]. Each particle X = {x1 , x2 , . . . xn} in PSO-SVM is described as follows. – x1 – non-negative integer value, which represents the algorithm used for classification; – x2 – real value, cost parameter C; – x3 – real value which represents bias or inertia weight w. Each particle describes a potential solution in search space to the problem being solved. The personal best (pbest) of a given particle is the position of the particle that has obtained the maximum value given by the classification method used. The local best (lbest) is the position of the best particle member of the neighborhood of a given particle. The global best (gbest) is the position of the best particle of the entire population. The velocity is the vector that determines the direction in which a particle needs

538

A.J. Sahoo and Y. Kumar / Seminal quality prediction using data mining methods Table 6 Parameters setting for Navie Bayes (Kernel) method Parameters Estimation method Minimum bandwidth Number of kernels

Values Greedy 0.01 10

to move, in order to improve its current position. The inertia weight, denoted by W, is employed to control the impact of the previous velocities on the current velocity of a given particle. The results can be dependent on the number of particles used in optimization, larger the number of particles used better is the coverage of search space obtained, but the large demand of computational resources. The other factor that can affects the results are the initialization of particles which are randomly initialized in search space; thus sequential ordering as well as random number generator are applied in the implementation of this algorithm can also have impact on the results. Parameters setting for SVM- PSO are mentioned in Table 5 are used in this paper to obtain desired results. 3.5. Kernel based Navie Bayes (NB) classifier method Navie Bayes (NB) classifier is one of the simplest classifiers based on Bayesian theorem [45,46]. It is assumed that Naive Bayes classifier is applied when features are independent to each class, but it also works well when independence assumption is not valid. It classifies the data in two steps: Training step: In this step, NB estimates the parameters of a probability distribution such that features are conditionally independent to a given class. Prediction step: In prediction step, NB calculates the posterior probability of unknown test data that belong to each class and classifies data according the posterior probability. In NB, it is assumed that the features are conditionally independent to a given class. But in spite of this strong assumption, its performance is quite good even in databases which do not hold the independence assumption. To avoid the strong parametric assumption, the kernel based navie bayes method is developed [47]. The NB (Kernel) method is used with non-parametric kernel density estimation for modeling the conditional density of a continuous variable with a given value of its parents f (x/C = c). This method can approximate more complex distributions than the Gaussian parametric approach. In literature, it is also found that kernel density estimation is more flexible estimator as compared to the multinomial distribution and also flexible than the histograms [48]. Table 6 shows the parameters setting for NB (kernel) method. 4. Experimental results In this paper, MLP, DT, NB Kernel, SVM-PSO and SVM methods are used to evaluate the performance on fertility dataset and its application in the prediction of the male fertility rate. A dataset is formed from 100 volunteers between 18 and 36 years which shows the relationship of life habits and environmental factors with semen parameters. SVM-PSO method which is the hybridization of SVM and PSO techniques has high accuracy rate among all other techniques. But, this paper is also throwing light on the role of feature selection methods for prediction as well as classification. Thus, initially feature selection methods are applied on the fertility dataset to find out which features are more important for prediction. The eight feature selection techniques are used to find out weight of each feature of fertility


539

Table 7 Shows the statistics of fertility dataset Name of attribute Diagnosis Season Age Childish diseases Accident or serious trauma Surgical intervention High fevers Alcohol consumption Smoking habit Number of hours spent sitting

Attribute role Prediction Regular Regular Regular Regular Regular Regular Regular Regular Regular

Attribute type Binominal Nominal Real Nominal Nominal Nominal Nominal Nominal Nominal Nominal

Attribute statistics mode = N (88), least = A (12) avg. = −0.072 ± 0.797 avg. = 0.669 ± 0.121 avg. = 0.870 ± 0.338 avg. = 0.440 ± 0.499 avg. = 0.510 ± 0.502 avg. = 0.190 ± 0.581 avg. = 0.832 ± 0.168 avg. = −0.350 ± 0.809 avg. = 0.407 ± 0.186

Range of attributes N (88), A (12) [−1.000 ; 1.000] [0.500 ; 1.000] [0.000 ; 1.000] [0.000 ; 1.000] [0.000 ; 1.000] [−1.000 ; 1.000] [0.200 ; 1.000] [−1.000 ; 1.000] [0.060 ; 1.000]

Missing value 0 0 0 0 0 0 0 0 0 0

dataset. These techniques are PCA, Correlation NN, SVM, SVM-PSO, T-Test, Chi-square, Evolutionary LR. Table 1 shows the weight of each feature using eight different methods. From the Table 1, it can be concluded that childish diseases and high fever are less important parameters while the season, age and alcohol consumption are important parameter for the prediction of fertility rate in male. Thus, our experimental dataset contains only seven features instead of nine features. Now, MLP, DT, NB Kernel, SVM-PSO and SVM techniques are used to predict the fertility rate either it is normal or altered. These technique are also applied on fertility dataset without feature selection method i.e. fertility dataset with nine features. The accuracy, sensitivity, specificity, area under curve, positive predicted value and negative predicted value parameters are used to evaluate the performance of these techniques using kcross fold method. A confusion matrix is obtained to calculate the value of above mentioned parameters. Table 3 shows the confusion matrix for MLP, DT, NB Kernel, SVM-PSO and SVM techniques using fertility dataset. 4.1. Dataset characteristics Fertility dataset contains nine features which are season, age, accident/trauma, childish disease, high fever, surgical intervention, alcohol consumption, smoking habits and number of hours spent sitting per day. Table 7 shows the statistics of fertility dataset. The original dataset contains 100 instances with nine attributes with two classes. The classes are normal and altered fertility rate. But, to find out more relevant features from fertility dataset, eight feature selection methods are applied to fertility dataset (see Table 1). From the Table 1, it is concluded that childish diseases and high fever features have less impact to predict the fertility rate both experimentally and medically. Due to the above reason, the final experimental dataset contains seven attributes rather than nine attributes. 4.2. Performance parameters Accuracy: Accuracy of a model is defined as the total positive instances of the model divided by the total number of instances. Accuracy parameter provides the percentage of correctly classified instances. The accuracy of model is defined as Accuracy =

TP + TN TP + FP + TN + FN

540

A.J. Sahoo and Y. Kumar / Seminal quality prediction using data mining methods Table 8 Confusion matrix generated for fertility dataset using MLP, SVM and DT techniques

Predicted

MLP (actual) True True normal altered pred. normal 87 7 pred. altered 1 5

SVM (actual) True True normal altered 87 8 1 4

SVM + PSO (actual) True True normal altered 86 4 2 8

NB Kernel (actual) True True normal altered 86 9 2 3

DT (actual) True True normal altered 84 7 4 5

Table 9 Performance comparison of MLP, SVM and DT technique using feature selection methods and without feature selection methods in which feature selection method with MLP, SVM and DT techniques enhance its performance Parameters

Without feature selection methods MLP SVM SVM+PSO NB DT (%) (%) (Kernel) (%) Classification accuracy (%) 86 86 85 83 84 Specificity (%) 94.1 97.7 92.04 90.9 96.5 Sensitivity (%) 40 20 33.33 25 13.3 Positive predictive value (%) 89.9 87.4 91.09 89.88 86.3 Negative predictive value (%) 54.5 60 57.14 41.3 40

MLP (%) 92 98.86 41.67 92.55 83.33

Using feature selection methods SVM SVM+PSO NB (%) (Kernel) 91 94 89 98.96 97.33 97.72 33 66.67 25 91.58 95.56 90.52 80 80 60

DT (%) 89 95.44 41.6 92.3 55.5

Fig. 4. The systematic diagram of 10- cross fold technique in which bold red alphabet portion of the wheel act as test instance while others act as training instances. This process will be executed up to 10 iterations and the each iteration will be consists the different test instance. (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/THC-140816)

Sensitivity: This parameter is used to determine the degree of the attribute to correctly classify the person with diseases and is defined as TP Sensivity = TP + FN Specificity: This parameter is used to determine the degree of the attribute to correctly classify the person without diseases and is defined as TN Specificity = TN + FP Positive predicted value: The positive predicted value is belongs to the prevalence of a disease. It is probability that a person actually has disease. TP Positive predicted value = TP + FP Negative predicted value: The negative predicted value is also belongs to the prevalence of a disease. It can be defined as the probability that a person does not have a disease. TN Negative predicted value = TN + FN


541

Fig. 5. AUC and ROC threshold of MLP (AUC 0.728). (Colours are visible in the online version of the article; http://dx.doi. org/10.3233/THC-140816)

Fig. 6. AUC and ROC threshold of SVM (AUC 0.758). (Colours are visible in the online version of the article; http://dx.doi. org/10.3233/THC-140816)

Area under curve: Area under curve (AUC) is the important parameter which is used to assess the performance of the diagnostic test as well as to identify the prevalence of a disease [42]. It is two dimensional plots between the sensitivity and specificity and measure the validity of medical tests. 4.3. K-Cross fold Technique The k-cross fold technique is widely used to validate the performance of data mining models as well as statistical analysis of datasets [41]. In k-cross fold technique, k is defined as the number of folds in the

542


Fig. 7. AUC and ROC threshold of PSO+SVM (AUC 0.932). (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/THC-140816)

Fig. 8. AUC and ROC threshold of NB (Kernel) (AUC 0.850). (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/THC-140816)

dataset in which k-1 folds is used as training instances and kth fold is used as test instance. In this paper, the numbers of folds are 10 i.e. the dataset is divided into 10 parts such that D = {a1 , a2 , a3 . . . . . . a10 }. However, 9 folds out of ten folds are used as training instances and the 10th fold is used as test instance that gives the accuracy of model. In the k-cross fold technique the test instance is used to predict the class labels. A systematic diagram of 10-cross fold technique is given in Fig. 4. Table 8 shows confusion matrix obtained for fertility dataset using MLP, DT, NB Kernel, SVM +


543

Fig. 9. AUC and ROC threshold of DT (AUC 0.735). (Colours are visible in the online version of the article; http://dx.doi.org/ 10.3233/THC-140816)

PSO and SVM techniques. Table 9 is prepared by using Table 8. From Table 9, it is found that MLP, DT, NB Kernel, SVM-PSO and SVM techniques with feature selection methods provide more accurate results as compare to these techniques without feature selection methods. From this table, it can be stated that the feature selection methods play vital role for prediction as well as classification because use of feature selection methods not only improve the accuracy of techniques but the results of other parameters are also improved for example positive predicted value and negative predicted value which gives the clear picture of a person either disease affected or not and same in case of sensitivity and specificity. Figures 5–9 shows the analysis of Roc and AUC parameters values. The area under the ROC curve measures the diagnostic accuracy of a test and used to make comparisons between diagnostic tests or observer. From these figures, it can be concluded that the SVM-PSO technique obtains the higher value (0.932) among all these methods. This indicates that the SVM-PSO techniques are more capable to distinguish the normal or affected person. 5. Conclusion In this paper MLP, DT, NB Kernel, SVM-PSO and SVM methods are used to predict male fertility rate as well as to identify the environmental and lifestyle parameters which may affect the semen quality. A comparison is performed between MLP, DT, NB Kernel, SVM-PSO and SVM methods in Table 4. The results are taken by means of two ways: with feature selection methods and without feature selection methods. From the above comparison, it can be concluded that features selection methods have significant role in prediction of the male fertility rate by two ways. First, due to the feature selection methods, the accuracy of each technique is increased i.e. (multilayer perceptron 92%, support vector machine 91%, SVM-PSO 94%, NB (kernel) 89% and decision tree 89% instead of multilayer perceptron 86%, support vector machine 86%, SVM-PSO 85%, NB (kernel) 83% and decision tree 84%). Second, the feature selection methods are also used to identify the environmental and lifestyle parameters which

544


have more impact on prediction task. It is also found that SVM-PSO method provides better results than other techniques.

References [1]

Inhorn MC. Global infertility and the globalization of new reproductive technologies: Illustrations from Egypt. Soc Sci Med 2003; 56: 1837-1851. [2] Lutz W, O’Neill BC, Scherbov S. Demographics. Europe’s population at a turning point.Science 2003; 299: 1991-1992. [3] Grant J, Hoorens S, Sivadasan S, Loo MV, Davanzo J, Hale L, Butz W. Trends in European fertility: should Europe try to increase its fertility rate . . . or just manage the consequences? Int JAndrol 2006; 29: 17-24. [4] Skouby SO. Contraceptive use and behavior in the 21st century: A comprehensive studyacross five European countries. Eur J Contracept Reprod Health Care 2004; 9: 57-68. [5] Cibula D. Women’s contraceptive practices and sexual behaviour in Europe. Eur J Contracept Reprod Health Care 2008; 13: 362-375. [6] Skakkebaek NE, Jorgensen N, Main KM, Rajpert-De Meyts E, Leffers H, Andersson AM, Juul A, Carlsen E, Mortensen GK, Jensen TK, Toppari J. Is human fecundity declining? Int J Androl 2006; 29: 2-11. [7] Carlsen E, Giwercman A, Keiding N, Skakkebaek NE. Evidence for decreasing quality of semen during past 50 years. BMJ 1992; 305: 609-613. [8] Auger J, Kunstmann JM, Czyglik F, Jouannet P. Decline in semen quality among fertile men in Paris during the past 20 years. N Engl J Med 1995; 332: 281-285. [9] Splingart C, Frapsauce C, Veau S, Barthelemy C, Royere D, Guerif F. Semen variation in a population of fertile donors: Evaluation in a French centre over a 34-year period. Int J Androl 2011; 35: 467-474. [10] Berling S, Wolner-Hanssen P. No evidence of deteriorating semen quality among men in infertile relationships during the last decade: A study of males from Southern Sweden. Hum Reprod 1997; 12: 1002-1005. [11] Andolz P, Bielsa MA, Vila J. Evolution of semen quality in North-eastern Spain: A study in 22,759 infertile men over a 36 year period. Hum Reprod 1999; 14: 731-735. [12] Irvine DS. Male reproductive health: cause for concern, Andrologia 2000; 32: 195-208. [13] Skakkebaek NE, Rajpert-De Meyts E, Jorgensen N, Main KM, Leffers H, Andersson AM,Juul A, Jensen TK, Toppari J. Testicular cancer trends as ‘whistle blowers’ of testicular developmental problems in populations. Int J Androl 2007; 30: 198-204. [14] Wong WY, Zielhuis GA, Thomas CM, Merkus HM, Steegers-Theunissen RP. New evidence of the influence of exogenous and endogenous factors on sperm count in man. Eur J Obstet Gynecol Reprod Biol 2003; 110: 49-54. [15] Giwercman A, Giwercman YL. Environmental factors and testicular function. Best Pract Res Clin Endocrinol Metab 2011; 25: 391-402. [16] Martini AC, Molina RI, Estofan D, Senestrari D, Fiol de Cuneo M, Ruiz RD. Effects of alcohol and cigarette consumption on human seminal quality. Fertil Steril 2004; 82: 374-377. [17] Agarwal A, Desai NR, Ruffoli R, Carpi A. Lifestyle and testicular dysfunction: a brief update. Biomed Pharmacother 2008; 62: 550-553. [18] Kolettis PN. Evaluation of the subfertile man. Am Fam Physician 2003; 67: 2165-2172. [19] Bonde JP, Ernst E, Jensen TK, Hjollund NH, Kolstad H, Henriksen TB, Scheike T,Giwercman A, Olsen J, Skakkebaek NE. Relation between semen quality and fertility: apopulation-based study of 430 first-pregnancy planners. Lancet 1998; 352: 1172-1177. [20] Barratt CL, Clements S, Kessopoulou E. Semen characteristics and fertility tests required for storage of spermatozoa. Hum Reprod 1998; 13(2): 1-7. [21] Su CT, Yang CH. Feature selection for the SVM: An application to hypertension diagnosis. Expert Systems with Applications, 2008; 34(1): 754-763. [22] Yadav G, Kumar Y, Sahoo G. Predication of Parkinson’s disease using data mining methods: A comparative analysis of tree, statistical, and support vector machine classifiers. Indian Journal of Medical Science, 2011: 65(6): 231-42. [23] Karabatak M, Ince MC. An expert system for detection of breast cancer based on association rules and neural network. Expert Systems with Applications, 2009: 36(2): 3465-3469. [24] Er O, Temurtas F, Tanrıkulu AC. Tuberculosis disease diagnosis using artificial neural networks. Journal of medical systems, 2010: 34(3): 299-302. [25] Kumar Y, Sahoo G. Prediction of different types of liver diseases using rule based classification model. Technology and Health Care, 2013: 21(6): 417-432. [26] David G, Girela JL, Juan JD, Jose Gomez-Torres M, Johnsson M. Predicting seminal quality with artificial intelligence methods Expert Systems with Applications 2012: 39(16): 12564-12573.

A.J. Sahoo and Y. Kumar / Seminal quality prediction using data mining methods [27] [28]

545

Liu H, Hiroshi M. Feature selection for knowledge discovery and data mining. Springer, 1998. Isabelle G, André E. An introduction to variable and feature selection, The Journal of Machine Learning Research 2003: 3: 1157-1182. [29] Ron K, John GH. Wrappers for feature subset selection, Artificial intelligence 1997: 97(1): 273-324. [30] Brida JG, Adrián Risso W. Hierarchical structure of the German stock market, Expert Systems with Applications 2010: 37(5), 3846-3852. [31] Quinlan JR. Unknown attributes values in induction, in ML, 1989, 164-168. [32] Quinlan, John Ross. C4. 5: programs for machine learning. Vol. 1. Morgan kaufmann, 1993. [33] Breiman L, Classification and regression trees. Wadsworth International Group, 1984. ISBN 9780534980535. [34] Ripley BD, Pattern recognition and neural networks. Cambridge University Press, 1996. [35] Haykin S, Neural networks: A comprehensive foundation, Englewoods Cliffs. NJ: Prentice-Hall, 1998. [36] Bishop CM, Neural networks for pattern recognition. Oxford Univ Pr, 2005. [37] Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature, 1986: 323: 533536. [38] Vapnik V, The nature of statistical learning theory, Springer-Verlag, New-York, 1995. [39] Hsu CW, Chang, CC, Lin CJ, A practical guide to support vector classification, National Taiwan University, Taiwan, 2003. [40] Theodoridis S, Koutroumbas K. Pattern recognition. 9781597492720. Elsevier/Academic Press, 2008. [41] Geisser S. Predictive inference: An introduction (Vol. 55). CRC Press, 1993. [42] Marzban C, The ROC curve and the area under it as performance measures, Weather and Forecasting (2004): 19(6): 1106-1114. [43] Prasad Y, Kanad K. Biswas, PSO-SVM Based Classifiers: A Comparative Approach, In Contemporary Computing, Springer Berlin Heidelberg, 2010, 241-252. [44] Mandal I, SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability using Machine Learning Ensembles, 2012: 2(3): 267-276. [45] Langley P, Iba W, Thompson K. An analysis of Bayesian classiifers, in: Proceedings of the 10th National Conference on Artificial Intelligence, 1992, pp. 223-228. [46] Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine learning, 1997: 29(2-3): 131-163. [47] John GH, Langley P. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence 1995, 338-345. [48] Silverman BW. Density estimation for statistics and data analysis. Vol. 26. CRC press, 1986.

Copyright of Technology & Health Care is the property of IOS Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Prediction of survival in thyroid cancer using data mining technique.

Data-Mining-Based Coronary Heart Disease Risk Prediction Model Using Fuzzy Logic and Decision Tree.

Prediction of possible CaMnO3 modifications using an ab initio minimization data-mining approach.

Mining Electronic Health Records using Linked Data.

Mining TCGA data using Boolean implications.

The Saccharomyces Genome Database: Advanced Searching Methods and Data Mining.

Machine Learning and Data Mining Methods in Diabetes Research.

Application of data mining methods for classification and prediction of olive oil blends with other vegetable oils.

Protein-protein interaction predictions using text mining methods.

Data Mining Methods for Omics and Knowledge of Crude Medicinal Plants toward Big Data Biology.

Genomic prediction of disease occurrence using producer-recorded health data: a comparison of methods.

Improving the prediction of going concern of Taiwanese listed companies using a hybrid of LASSO with data mining techniques.

Mining personal data using smartphones and wearable devices: a survey.

Mining kidney toxicogenomic data by using gene co-expression modules.

Bioprocess data mining using regularized regression and random forests.

Clinical diabetes research using data mining: a Canadian perspective.

Implementation of hospital examination reservation system using data mining technique.

Ensemble Methods for MiRNA Target Prediction from Expression Data.

Predicting long-term outcome after traumatic brain injury using repeated measurements of Glasgow Coma Scale and data mining methods.

Data mining in radiology.

Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining.

Data mining approaches to high-throughput crystal structure and compound prediction.

Data mining for rapid prediction of facility fit and debottlenecking of biomanufacturing facilities.

Prediction by data mining, of suicide attempts in Korean adolescents: a national study.