Biomedical visual data analysis to build an intelligent diagnostic decision support system in medical genetics.

G Model

ARTICLE IN PRESS

ARTMED-1356; No. of Pages 14

Artificial Intelligence in Medicine xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim

Biomedical visual data analysis to build an intelligent diagnostic decision support system in medical genetics Kaya Kuru a,∗ , Mahesan Niranjan b , Yusuf Tunca c , Erhan Osvank d , Tayyaba Azim b a

Department of Communication, Electronics, and Information Systems, Gülhane Military Medical Academy, Etlik, Ankara 06010, Turkey School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BF, UK Department of Medical Genetics, Gülhane Military Medical Academy, Etlik, Ankara 06010, Turkey d Institute of Informatics, Middle East Technical University, Balgat, Ankara 06531, Turkey b c

a r t i c l e

i n f o

Article history: Received 12 May 2013 Received in revised form 15 August 2014 Accepted 16 August 2014 Keywords: Decision support system Machine learning Visual data analysis Principal components analysis Medical genetics Dysmorphology Facial genotype–phenotype

a b s t r a c t Background: In general, medical geneticists aim to pre-diagnose underlying syndromes based on facial features before performing cytological or molecular analyses where a genotype–phenotype interrelation is possible. However, determining correct genotype–phenotype interrelationships among many syndromes is tedious and labor-intensive, especially for extremely rare syndromes. Thus, a computer-aided system for pre-diagnosis can facilitate effective and efficient decision support, particularly when few similar cases are available, or in remote rural districts where diagnostic knowledge of syndromes is not readily available. Methods: The proposed methodology, visual diagnostic decision support system (visual diagnostic DSS), employs machine learning (ML) algorithms and digital image processing techniques in a hybrid approach for automated diagnosis in medical genetics. This approach uses facial features in reference images of disorders to identify visual genotype–phenotype interrelationships. Our statistical method describes facial image data as principal component features and diagnoses syndromes using these features. Results: The proposed system was trained using a real dataset of previously published face images of subjects with syndromes, which provided accurate diagnostic information. The method was tested using a leave-one-out cross-validation scheme with 15 different syndromes, each of comprised 5–9 cases, i.e., 92 cases in total. An accuracy rate of 83% was achieved using this automated diagnosis technique, which was statistically significant (p < 0.01). Furthermore, the sensitivity and specificity values were 0.857 and 0.870, respectively. Conclusion: Our results show that the accurate classification of syndromes is feasible using ML techniques. Thus, a large number of syndromes with characteristic facial anomaly patterns could be diagnosed with similar diagnostic DSSs to that described in the present study, i.e., visual diagnostic DSS, thereby demonstrating the benefits of using hybrid image processing and ML-based computer-aided diagnostics for identifying facial phenotypes. © 2014 Published by Elsevier B.V.

1. Introduction Dysmorphology is an area of clinical genetics that is concerned with abnormal patterns of human development and syndrome diagnosis in patients who possess congenital malformations and unusual facial features, often with delayed motor and cognitive development [1]. A high degree of experience and expertise is required to diagnose a dysmorphic patient correctly [2] because most of these syndromes are very rare. However, in some parts of

∗ Corresponding author. Tel.: +90 5439705637. E-mail address: [email protected] (K. Kuru).

the world, the diagnosis of syndromes is generally performed by medical professionals who are not well trained in dysmorphology, such as general practitioners, pediatricians, or dermatologists, rather than medical geneticists, because the latter are scarce. In general, the diagnosis of dysmorphology is conducted based on databases that contain limited numbers of images, which use standard terminology. Medical professionals might not be highly familiar with this terminology, especially in areas where expert knowledge is not readily accessible, which may make it difficult to obtain correct diagnoses. These challenges may lead to diagnostic inaccuracy, thereby compromising the appropriate treatment for patients to suit their specific needs as well as the provision of adequate guidance to the parents of patients. Delays in diagnosis may

http://dx.doi.org/10.1016/j.artmed.2014.08.003 0933-3657/© 2014 Published by Elsevier B.V.

Please cite this article in press as: Kuru K, et al. Biomedical visual data analysis to build an intelligent diagnostic decision support system in medical genetics. Artif Intell Med (2014), http://dx.doi.org/10.1016/j.artmed.2014.08.003

G Model ARTMED-1356; No. of Pages 14 2

ARTICLE IN PRESS K. Kuru et al. / Artificial Intelligence in Medicine xxx (2014) xxx–xxx

also hinder access to critical services, such as clinical trials, and a patient’s referral to supportive services, including early intervention, physical therapy, and occupational therapy. Correct diagnoses and appropriate treatments, particularly during the early stages, can influence the course of dysmorphic diseases. For example, bone marrow transplantation or enzyme replacement therapy can now be offered for some innate metabolic disorders (e.g., Fabry disease) [1], in addition to many other specific treatments for other syndromes. The face is acknowledged to be the attribute that best distinguishes a person from others, even at the first glance. Facial features provide many clues about the identity, age, gender, and even ethnicity of a person. The face may be influenced by many genes, particularly the genes related to syndromes, and the face provides significant information related to dysmorphology in many cases. Thus, the facial appearance is a significant cue during the early diagnosis of syndromes that are generally associated with cognitive impairments. Therefore, many decision support systems (DSSs) for dysmorphic diagnosis have been developed based on anthropometry, particularly craniofacial anthropometry, as well as stereophotogrammetry. Anthropometry is used to measure the weight, size, and proportions of the human body [3], while craniofacial anthropometry measures the distances between landmarks on the surface anatomy of the head [4]. Stereophotogrammetry employs multiple views in two-dimensional (2D) images to generate three-dimensional (3D) images [5]. Previous studies have shown that many syndromes can be diagnosed correctly using computer-aided face analysis DSSs [2,3,6–8]. In particular, Farkas [3] was the first to study facial morphology based on anthropometry using several methods. These techniques include the use of rulers, protractors, calipers, and tape measures, and they have been applied widely in the analysis of facial dysmorphology [8]. Similar craniofacial analyses that compare a patient’s phenotype to the standardized norms in a control population are employed by many clinicians [8]. A possible approach for diagnosing dysmorphic patients is to define rule sets and to apply them manually based on standardized norms. This approach may be feasible in some cases, but it has many drawbacks and is prone to errors. In practice, it is very difficult for health care professionals to keep track of all the relevant up-to-date knowledge regarding syndromes and to deal effectively with large volumes of information in many dimensions [9]. Indeed, humans might not be able to develop a systematic response to any problem that involves more than seven variables [10]. Moreover, constructing and employing rule sets is also a labor-intensive process. Assigning faces to classes based on appearance is unlikely to be accepted by medical professionals unless the mathematical features determined and identified by feature selection algorithms for discrimination can be related to facial patterns [11]. Several methods have been applied previously to the detection and analysis of facial patterns, such as principal components analysis (PCA), kernel PCA, independent components analysis, probability density estimation, local feature analysis, elastic graph matching (EGM), multi-linear analysis, kernel discriminant analysis, Gabor wavelet (GW), Fisher’s linear discriminant analysis (LDA), and support vector machines. In particular, EGM, Fisher’s LDA, GW, and PCA using eigenfaces have been employed widely to extract features from a face region. The high accuracy of these methods for extracting features and subsequently discerning patterns in faces has been demonstrated in many studies. Among these popular techniques, it is not easy to choose the best to implement a diagnostic DSS for dysmorphology. Thus, we suggest the use of ensembles of some of these methods to reduce error rates in future research. However, excellent results can be achieved using PCA based on feature extraction and the subsequent discernment of people from others, with

accuracy rates of up to 96% [12]. Using PCA, good success rates can be obtained by detecting patterns in faces from images captured in ideal environments, particularly with good illumination, or by employing several image processing techniques to enhance images before feature extraction. PCA is an optimal transition scheme that minimizes the mean squared error between an image and its reconstruction [13]. PCA using eigenfaces is computationally efficient compared with other similar methods [14] because reducing the dimension from 2D to one-dimensional can be performed easily to accelerate the calculations. Thus, we employ the PCA-based eigenface method to extract features from faces in our visual diagnostic DSS. Furthermore, this machine learning (ML) method was selected because of its extensive and successful applications to many datasets. In addition, Bayesian decision theory, multiple similarity, city block distance, subspace, Mahalanobis distance, and Euclidean distance are well-known methods for measuring the distance between two points in a features dataset [13]. The Mahalanobis distance and Euclidean distance are the most widely used of these methods [13]. However, Kapoor [13] showed that the Mahalanobis distance is more effective than the Euclidean distance. It differs from the Euclidean distance because it considers the correlations in the dataset and it is scale-invariant, i.e., it is not dependent on the scale of measurements [13,15].1 In the present study, we tested these two methods using our features dataset to determine the best for use in our method, and we found that the Euclidean distance outperformed the Mahalanobis method for measuring distances. Thus, this matching technique was selected for our study. Hammond [8] claimed that the analysis of 2D or 3D facial morphology images using computer-aided DSSs based on genotype–phenotype correlations could potentially benefit syndrome diagnosis, and our study supports this claim. The method established in the present study is called visual diagnostic DSS. This method aims to provide the required on-site expertise, but it also attempts to eliminate the time-consuming search of catalogs by practitioners and geneticists to diagnose syndromes, because there are approximately 4700 known syndromes.2 In the proposed methodology, ML algorithms and digital image processing techniques are employed in a hybrid approach to detect meaningful facial features in reference images of disorders by indicating visual genotype–phenotype interrelationships. The proposed system was trained using a real dataset constructed from previously published images of dysmorphic faces, which included accurate diagnostic information about the syndromes considered in the present study. After training, during the diagnosis phase, the system compares the patient’s facial features to all the trained features in the database to obtain a ranked list of possible matches based on confidence values above the threshold value specified by the user. The ranking list with similarity values explains how similar a disease is to those classified in the database relative to a particular threshold value. The application can be implemented easily at any site. New syndromes can be trained and the dataset can be extended by the end user to improve the implementation. Our statistical method represents facial image data in terms of principal component (PC) features and it diagnoses syndromes using these features. We evaluated the accuracy of the method using a leave-one-out cross-validation scheme.

1 More information about the Mahalanobis distance can be found in Gul’s thesis [15]. 2 Many new dysmorphic diseases are described each year in the London Dysmorphology Database (http://www.lmdatabases.com [accessed 25.01.14]).


G Model ARTMED-1356; No. of Pages 14


3

Fig. 1. Conceptual framework of the method. The four main modules are: 1, image acquisition and face detection; 2, image enhancement and feature extraction; 3, training; and 4, diagnosis. These modules comprise several sub-modules, which are delineated in dedicated sections of the main modules.

2. Methodology 2.1. Background We evaluated previous studies of visual DSSs in terms of their advantages and disadvantages as a first step to facilitate the production of an effective diagnostic system for dysmorphology. Most previous genotype–phenotype association studies have focused on a limited number of specific diseases or traits to test whether a computer can classify syndromes and then diagnose new cases based on comparisons with the training cases. Previous studies have reported successful classification [2,6,7,16] using several methods (e.g., using dense surface models for the construction of 3D images3 ) based on images of children with a limited number of syndromes, typically 5–10. Most of these studies involved labor-intensive data preprocessing steps, such as manual cropping of images to obtain faces from whole images, and several image enhancement methods in various applications, including commercial tools, were used to obtain better

3 The drawbacks of using 3D construction for syndrome delineation are mentioned in the Discussion section.

images for further processing. In general, these studies addressed a single aspect of the needs of medical professionals, such as recognizing a syndrome from a photograph based on comparisons with several trained syndromes, thereby developing a specific computer-based system rather than providing a complete solution to meet the needs of the overall field of dysmorphology. The field of dysmorphology needs a composite solution that allows the automatic diagnosis of dysmorphic diseases from raw data (live/video inputs or photographs) without human intervention and that also satisfies the everyday needs of medical professionals when considering their cases, by recognizing a wide range of dysmorphisims, especially when appropriate genetic tests are not available. Moreover, a solution is needed to inform subsequent investigations, including more appropriate genetic testing, thereby avoiding or delaying the need to undertake more expensive genetic tests. Our evaluation of related studies forms the basis of the software requirements analysis, which comprises behavioral, architectural, and functional requirements. Our proposed method is based on the idea that incorporating several well-known methods into a new hybrid approach may facilitate better diagnosis by clinicians in several ways, as suggested in previous studies [17,18] that used hybrid approaches.




4

Fig. 3. Conceptual understanding of image cropping to obtain a frontal face image that is ready for analysis.

Fig. 2. Interface of the method. Four messages are displayed: the name of the likely diagnosis (e.g., Mowat-Wilson syndrome), the degree of proximity to the detected diagnosis (e.g., 0.86), the threshold value determined by the user (e.g. 0.75), and a message that displays the current state as either recognized successfully or unknown disease. The messages are updated when other syndromes are diagnosed as exceeding the threshold value.

2.2. System design and architecture The visual diagnostic DSS method was developed using the C++ programming language and it comprises several main modules: 1, image acquisition and face detection; 2, image enhancement and feature extraction; 3, training; and 4, diagnosis. These main modules are divided into several sub-modules as shown in Fig. 1. The interface of the implemented method is depicted in Fig. 2. We briefly explain the main modules in the following subsections. 2.2.1. Image acquisition and face detection module The images of patients are captured by this module for further analysis. Frontal face images related to different syndromes can be acquired in various environments, e.g., using a camera attached to a computer in real time, from previously recorded videos of patients with syndromes, or from a self-maintained dataset of images stored in a folder, as shown in the image acquisition and face detection phase in Fig. 1. A high-resolution digital camera is mounted across a dysmorphic patient and images can be captured automatically, provided that a frontal face image is detected. The time required to capture an image is 0.2 s, which is triggered by the system automatically, excluding the time when an appropriate frontal face image cannot be acquired. Thus, image capture is instantaneous, which may be more suitable for photographing children with mental retardation who are unable to hold a pose for long periods or who may be uncooperative. Photographs that lack frontal face images are not saved and the application remains idle while searching for an appropriate frontal face image during this period, especially when the head turns to the sides, up, or down. Similarly, frontal face images of dysmorphic patients can be acquired from video inputs as well as from local databases. Acquiring proper frontal faces is an essential requirement for the automatic data preparation and model-building phase. Images that lack appropriate frontal faces, as shown in Fig. 3, are not captured and saved. Thus, preset frontal faces are ready to be treated and they require no manual preprocessing or data preparation steps. The

characteristics of the dataset used in this study are described in Section 2.3. 2.2.2. Image enhancement and feature extraction module Images of patients are acquired for training a specific syndrome or for diagnosis depending on the function triggered by the user. The images are processed automatically for enhancement and feature extraction before the diagnosis or training phases. After acquisition, the images are converted into grayscale and cropped to include only face regions, i.e., the forehead, two eyes, cheeks, and mouth, as shown in Fig. 2, which are delineated as shown in Fig. 3. The size of a cropped image depends on the resolution, image size, and the face occupation area of the image. Thus, comparisons between features in subsequent processes are not possible using images of different sizes. Therefore, the cropping stage is followed by normalization of the image using well-known interpolation and extrapolation methods. In the first training stage employed to establish classifiers for syndromes, all of the images are cropped initially and an average mapping size is calculated by the application based on all the sizes of the cropped images, and all of the cropped images are then scaled to this new specified size. Some of the images are extrapolated if their height and width are smaller than the average size, whereas they are interpolated otherwise. In the following phases, i.e., diagnosis and the individual training of other syndromes, normalization is performed according to the previously specified averaged size. Thus, each cropped image becomes smaller or larger and it is mapped to the same size in terms of the specified width and height before comparisons are made among similar features. The images are then standardized using two essential image enhancement methods, i.e., histogram equalization and median filtering, to remove illumination variations and to obtain standard brightness and contrast levels. Thus, excessively dark or low contrast images are enhanced and better features can be captured. We apply PCA to the standardized face images to extract significant features for classification. The PCA method accelerates our application during the training and diagnosis phases due to its simplicity, learning capacity, and reduced computational complexity. A set of images in a high-dimensional feature space is transformed into a lower-dimensional feature space with a set of feature images, i.e., using a few eigenfaces (e.g., PCs). The first eigenface represents the direction where the data has its maximum variance, which considers the most valuable feature of the images in the training set. Each subsequent eigenface has the next most valuable feature with respect to its variance. To implement PCA, the color images are converted into 8-bit grayscale images. Next, rows of these gray images




are placed into a column vector (I) because PCA works on vectors instead of images. An average face vector that represents the mean image ( ) is obtained using Eq. (1), as follows: = (I1 + I2 + · · · + Im )/m),

(1)

where m represents the number of images in the dataset. Next, we normalize the images by subtracting the mean vector from each image vector ( i = Ii − ). This process generates unique features that differ from the mean image for each face, i.e., the mean vector. A covariance matrix is also calculated from this normalized feature matrix to prepare a subspace with reduced dimensionality. Eigenvectors can be calculated from the covariance matrix (C) as: C = AAT for real space and C = AT A for subspace, where A = [ 1 , 2 , 3 , . . ., m ]. Single value decomposition is then performed to obtain the most significant eigenvectors as A = UDVT , where D corresponds to the diagonal ( i singular values of A), U corresponds to the eigenvectors of AAT for real space, and V corresponds to the eigenvectors of AT A in subspace. We select the k most significant eigenvectors to operate on the subspace to reduce the cost of calculations and to significantly reduce the noise of the dataset. The k top columns are picked from V for AT A to obtain the top k eigenvectors. After this step, the images are represented by a reduced number k of eigenvectors. A weight vector [w1 , w2 , w3 , . . ., wk ] is obtained for each image in terms of the contribution of k eigenvectors based on several matrix operations, which describe the contribution of each eigenvector (e.g. w1 represents the contribution of the first eigenvector that comprises the image) to the representation of an image. Each image can be regenerated from the sum of the mean vector and the weighted sum of these k eigenvectors, which is a linear combination of the best k eigenfaces. Thus, these weights represent the face as a combination of k eigenfaces, which indicates the proportion of each image comprised by the eigenface relative to the sum of the mean image ( ). Comparisons are made between the weight vectors of the input image and those of the other images in the dataset to calculate the extent of similarity, as described in detail in Section 2.2.4. Our statistical method represents facial image data in terms of PCs and a leave-one-out evaluation scheme is used to quantify the accuracy. In the leave-one out-cross-validation scheme, we leave one image out as a test image and train the system using the remaining images. This process is repeated as many times as there are images (n) in the dataset, so every data point is used as a test sample to measure its similarity to the others (n − 1). Images of syndromes are captured for feature extraction during training or diagnosis using the functions shown in Fig. 2, as described in the following subsections. The selection of the number of PCs used is explained in Section 2.4 with respect to the size of the dataset. 2.2.3. Training module The implementation of PCA requires the manipulation of the eigenvectors/eigenvalues of the covariance matrix, rather than the raw image data, as mentioned above. Thus, the eigenvectors/eigenvalues, the average image vector, and the weight vectors generated by PCA are stored in a database with their syndrome names in the training phase. Classifiers are trained with the features extracted by the image enhancement and feature extraction module. A classifier that includes the features of all cases of any specific syndrome is generated for each syndrome. At least two images per syndrome are required to train a classifier. The measurements in the database are recalculated as the new syndromes are trained. Users can easily add new syndromes themselves using the functions “train from directory” for any syndrome stored in a directory and “train captured images” for a syndrome where the images are captured from a video/live input. The function “train all database” allows the user to train all of the syndromes at once

5

according to the implementation shown in Fig. 2. In this case, the directory names are the inputs for the syndrome (classifier) names, thereby allowing automatic training without human intervention. 2.2.4. Diagnosis module The trained classifiers are employed for prediction in this module. The diagnostic prediction of a patient requires the detection of a frontal face image from a camera or a file, conversion into a grayscale image, and processing by the two image enhancement methods mentioned above, which are followed by cropping, normalization ( i = Ii − ), and projection of the normalized vector onto the eigenface space to obtain the weights in a weight vector [w1 , w2 , w3 , . . ., wk ]. These weights represent the test face as a combination of k eigenfaces, which indicate the proportion of the image comprised by each eigenface relative to the summed mean image ( ). The weight vectors of the input image (a test image in a broader sense) are compared with those of the images classified after training, where the weight vectors are the only variables and the other parameters are constant, such as eigenfaces and mean vectors. Thus, a diagnosis can easily be obtained within a few seconds by comparing several features. The comparison is performed for each trained image in the database to identify all similar syndromes that exceed the threshold value supplied by the user. The best matches that exceed the threshold value are found for the syndromes with the minimum distances. These syndromes are the probable diagnoses, which are then displayed to the user with confidence values. These values represent the similarity of the input image to those in the trained set and they are used to assess the reliability of the proposed diagnostic inference. Many algorithms are available for comparing the weight vector of the input image to those of the trained images (measurement of the distance between two points) such as the city block distance, sub-space, multiple similarity, and Mahalonobis distance [19]. However, a simple and intuitive approach is to compare an individual face with other faces in the vector space using nearest-neighbor classification, i.e., the Euclidean distance.4 In image recognition, Euclidean distance comparisons aim to capture how similar or different a test object is compared with trained objects in terms of their weight vectors. All the PCs of the test image, which are represented as weight vector values, [w1 , w2 , w3 , . . ., wk ], are compared with the eigenvectors of the other images in the dataset and the mean total distances obtained from each comparison (in terms of the distances between the eigenvectors compared) correspond to the degree of difference between two images. The formula used to obtain the distances between the test image and other images in the dataset is Eq. (2), as follows:

⎧ ⎛ k ⎞⎫

2 ⎪ ⎪ ⎪ ⎪

⎪ Itwn − Itmn ⎟⎪ ⎪ ⎪ ⎬ ⎨ (m − 1) ⎜ ⎜ ⎟ n=1 (m−1) ⎜ ⎟ D1 {D1 , D2 , . . ., D(m−1) } = ⎜ ⎟⎪ , k 1 ⎪ ⎪ ⎝ ⎠⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ (2) where D is the column array that contains the distances between the test image and the other images, m is the number of images, Itwn represents the constant weight values in the weight vectors of the test image in terms of each comparison with all other images, Itmn represents the weight values of the other images in the dataset, and

4 Interested readers may read Calva’s article [20] for more information about Euclidean distance formulations and their implementation for comparing the weight vector of a test image to those of trained labeled images.




6

k is the number of values in a weight vector. The distance values are mapped between 0 and 1 to quantify the values. These values are then subtracted from a value of 1 to determine the similarity of the two images rather than their degree of difference. All of these values are placed in a table where the junction points of columns and rows indicate the similarity values between two images, one of which is in the row and the other is in the column. This table can be used to evaluate all of the confidence values generated for each comparison. The functions of the implementation stages (Fig. 2) performed during the diagnosis phase are explained in detail in Section 2.4. 2.3. Dataset We aimed to collect all possible frontal faces related to syndromes from previous publications to evaluate the method thoroughly and to quantify our results better. Thus, 612 publications with images of patients with syndromes were examined to obtain a good dataset. Specifically, studies of syndromes that included distinctive facial features were our main target because not all syndromes have distinctive facial features. After filtering, 31% of the publications (189) were selected because they described syndromes with facial images that provided clues related to syndromes that had been validated using a genetic test (a gold standard). The quality of some of these images was too low to acquire essential features in terms of the resolution of these images, especially in old publications. Capturing appropriate frontal face images was not possible in some cases due to the poses in the images. The eyes were also obscured in the faces in some of these images due to ethical issues. Images for which appropriate face views were not available were omitted from the study. The system requires at least two images per syndrome for training, but we aimed to include at least five images to reduce bias and to increase the significance of the results. Therefore, the syndromes that lacked at least five images were also discarded from the study. Ideally, the raw data for images used should preserve the original resolution, but only 27% of the raw images used in the study could be obtained whereas the remaining 73% of the images were cropped from the manuscripts.5 The final dataset assembled for this study comprised 15 syndromes with a total of 92 frontal face images, i.e., 5–9 per syndrome, as shown in Fig. 4. These syndromes comprised Mowat-Wilson [21,22], Goldenhar [23], Treacher Collins [24,25], Williams-Beuren [2,26], X-linked mental retardation [27,28], Cardiofaciocutaneous [29–32], Cohen [33–36], Angelman [37–41], Craniofrontonasal [42–44], Crisponi [45], Laron [46–49], Polyhydramnios, Megalencephaly, and Symptomatic Epilepsy (PMSE) [50], Fragile X [2,51], Pitt-Hopkins [52,53], and Potocki-Lupski [54]. The diagnoses of these syndromes had been validated by appropriate genetic tests and they were confirmed by the authors in their respective publications. The original images can be accessed via the references provided in the present manuscript. 2.4. Experimental design A screenshot of the implemented method is shown in Fig. 2. Image detection is performed to capture faces from the images in a folder selected by the user, or from a live/video stream of patients, and the application displays the images on the screen during processing. The “detect images from camera” function detects face appearances from a live video stream of patients who have

5 Note that the Ethical permission and copyright section at the end of this manuscript explains that ethical approval was obtained for our reuse of these images.

been diagnosed with the same dysmorphic disease. The detection process stops and the training process begins when the user clicks the “train captured images” function. The saved data for all the labeled diseases from patients can be used to train the system automatically without any human intervention. The application only asks the user to specify the location of the files for training and the syndrome names for detection in the live/video input. Individual syndromes can be trained with the “train from directory” function and many syndromes can be trained at the same time with the “train all database” function. Using this function, the system is trained with cases that have syndromes, which are located in specific directories, where the names indicate the syndrome names and the directory names are accepted as the syndrome names by the system during automatic training. After training, the system is ready to be used for diagnosis with images or live/video inputs. Several images can be selected in a folder, which can be compared to the labeled trained syndromes stored in the database using the “identify from directory” function to obtain a diagnosis. The functions mentioned above are capable of meeting the expectations of specialists, especially medical geneticists. In this study, we employed two of these functions to evaluate the methodology: the “train all database” function to train the syndrome images acquired in previous publications, as mentioned in Section 2.3; and the “identify from directory” function to test the syndrome images. These two functions allowed us to evaluate the dataset via a cross-validation process. The number of PCs in our study was equal to m − 1, where m was the number of cases in the dataset because the dataset was not very large. The number of PCs should be adjusted if the dataset is larger because high numbers of PCs may produce noise and incur high calculation costs. The method used to select the top k eigenvectors is described in Section 2.2.2. The noise caused by selecting high numbers of PCs can be observed by examining the eigenfaces, an example of which is shown in Fig. 5. The eigenfaces repeat as the number of PCs increases above a level and this noise can even be distinguished based on visual observations. The dataset used for training was not large, thus we performed a cross-validation of the entire dataset using a leave-one-out scheme and we determined the average performance level. The leaveone-out scheme was used to train the training set and to test the system to determine its accuracy. The leave-one-out crossvalidation was simply an n-fold cross-validation, where n is the number of instances in the dataset [55]. Each instance was omitted in turn and the learning scheme was trained with all the remaining instances. We determined the correctness based on the remaining instances, i.e., one for success and zero for failure. The results of all n judgments, i.e., one for each member of the dataset, were averaged and this average represented the final error estimate. This scheme is attractive for two reasons. First, the greatest possible amount of data is used for training in each case, which presumably increases the likelihood that a classifier is accurate. Second, the procedure is deterministic, i.e., random sampling is not involved [55]. Therefore, the accuracy estimate obtained using the leave-one-out scheme is known to be virtually unbiased [56]. Thus, we excluded one image as a test image and trained the system using the remaining 91 images in the leave-one-out cross-validation. Each separate case was used for testing and the remaining cases were used for training. We repeated this process 92 times, so every data point was left out as a test sample. The system could build a training set for these syndromes in less than 5 min due to the computational efficiency of PCA. The eigenfaces and the mean images for the dataset are shown in Fig. 5. The application extracted all the confidence values into a table and the confidence values that exceeded the threshold value were displayed on the




7

Fig. 4. Frontal face images of 15 syndromes. Each syndrome comprises 5–9 images. From top to bottom: Mowat-Wilson (5M: males), Goldenhar (7F: females, 2M), Treacher Collins (4F, 1M), Williams-Beuren (5M), X-linked mental retardation (7M), Cardiofaciocutaneous (1F, 4M), Cohen (1F, 4M), Angelman (6F, 2M), Craniofrontonasal (5M), Crisponi (5F, 2M), Laron (6F, 2M), PMSE (4F, 1M), Fragile X (6M), Pitt-Hopkins (1F, 4M), and Potocki-Lupski (5F, 2M). Each image is labeled according to their respective rows, i.e., the images in the first row are labeled as 1a, 1b, 1c, 1d, and 1e, and the images in the last row, 15, are labeled as 15a, 15b, 15c, 15d, 15e, 15f, and 15g.6




8

Fig. 6. Graphical representation of the success rates with the rule-in I, II, and III diagnoses, which yielded 49, 64, and 76 correct diagnoses, respectively, where the corresponding success rates were 53%, 70%, and 83%.

Fig. 5. (a) Eigenfaces of 15 syndromes that comprised 92 frontal faces, (b) mean face ( ).

screen with the syndrome names. In the case study, we aimed to determine the most probable diagnosis based on rule-in I, II, and III diagnoses by adjusting the threshold value for each test image. The rule-in I diagnosis corresponded to the most likely diagnosis, rule-in II was the second most likely diagnosis in addition to the first according to rule-in I, and rule-in III was the third most likely diagnosis in addition to the other two determined by rule-in I and II. The total time required to search all 15 classes trained with syndromes to find the most likely syndrome given the threshold value was 3 s for the test case. The user could adjust the threshold value to rule in or rule out diseases during the diagnosis process. Fewer diagnoses were suggested with a higher threshold value. With a lower threshold value, more diagnoses were suggested to the user, together with their confidence values. Thus, the success rates of the rule-in observations were obtained. 3. Experimental results A table containing the confidence values for all the pairwise comparisons among the syndromes was created with the “identify from directory” function in the diagnosis process. The three most likely diagnoses were selected automatically for the syndromes in this table by adjusting the threshold value in terms of rule-in I, II,

6 Please refer to the Ethical permission and copyright section at the end of the manuscript for copyright information regarding the images in this figure.

and III diagnoses. An outline of this table is shown in Tables 1 and 2. Table 1 presents the confidence values of the diagnosed syndromes and Table 2 shows the names of the diagnosed syndromes based on rule-in I, II, and III diagnoses. For example, the patient labeled as 1a matched with the patients labeled as 1e with a confidence value of 0.787 in terms of the rule-in I diagnosis. The patient with a known diagnosis of Mowat Wilson matched correctly according to rule-in I with 1e, who was also classified as Mowat Wilson. A geneticist could confirm the diagnosis after applying the first molecular test specified for Mowat Wilson, thereby avoiding other tests. The second case labeled as 1b matched with the patients labeled as 12e, 4a, and 2i with confidence values of 0.681, 0.653, and 0.653, respectively, based on rule-in I, II, and III diagnoses. The patient with a known diagnosis of Mowat Wilson did not match correctly based on the rule-in I (PMSE), rule-in II (Williams Beuren), and rule-in III (Goldenhar) diagnoses. The third case labeled as 1c was diagnosed correctly based on the rule-in II diagnosis, while the fourth case, 1d, was diagnosed based on the rule-in III diagnosis. In addition, for the last case of the Mowat Wilson, 1e, correct diagnosis was identified on the rule-in I diagnosis. The rule-in I, II, and III diagnoses obtained 49, 64 (15 more syndromes based on rule-in I), and 76 (12 more syndromes based on rule-in II) correct diagnoses, respectively. The cumulative success rates of the system for diagnosing syndromes correctly based on rule-ins I (53%), II (70%), and III (83%) are shown in Fig. 6. The success rate for each separate syndrome is also shown in Fig. 7. In particular, all of the frontal faces tested for two syndromes, i.e., Goldenhar (n = 9) and Laron (n = 8), were diagnosed correctly by rule-in I. In addition, all the cases for three syndromes were identified correctly by rule-in II, while all the cases for four syndromes were determined correctly by rule-in III. Thus, it can be concluded that a high number of syndromes with characteristic facial anomaly patterns can be diagnosed using computer-assisted ML algorithms because the face is affected by many genes that cause syndromes. The dataset used for training was not very large, thus a leaveone-out scheme based on the training set was used for training and testing the system to quantify its accuracy. The leave-one-out crossvalidation exploited this small dataset to its maximum extent and we obtained accurate estimates, although this is usually infeasible with large datasets due to their high computational cost [55] and the time-consuming work involved. All 92 cases were diagnosed manually by five specialists (two pediatricians and three medical geneticists) in terms of the rule-in III diagnosis as well. The mean syndrome diagnosis success rates (average ≈50%) of the specialists are shown in Fig. 8. We compared the results obtained using the proposed method and the diagnostic results of the specialists to determine the significance of the results generated by the proposed method. A paired t-test was


G Model

ARTICLE IN PRESS


K. Kuru et al. / Artificial Intelligence in Medicine xxx (2014) xxx–xxx

9

Table 1 Similarity values obtained with the rule-in I, II and III diagnoses, where the highest values are shown for comparison. The gray cells correspond to correct diagnoses. Rule-in I, II, and III yielded 49, 64, and 76 correct diagnoses, respectively, with corresponding success rates of 53%, 70%, and 83%. For example, the first case, 1a, for whom the known diagnosis was Mowat Wilson, matched correctly based on rule-in I with 1e, which also classified the patient as Mowat Wilson. The second case labeled as 1b matched with the patients labeled as 12e, 4a, and 2i with confidence values of 0.681, 0.653, and 0.653, respectively, based on the rule-in I, II, and III diagnoses. This patient with a known diagnosis of Mowat Wilson did not match correctly based on the rule-in I (PMSE), rule-in II (Williams Beuren), and rule-in III (Goldenhar) diagnoses. The third case labeled as 1c was diagnosed correctly based on the rule-in II diagnosis, and the fourth case, 1d, was diagnosed correctly based on the rule-in III diagnosis. In addition, for the last case of the Mowat Wilson, 1e, correct diagnosis was identified on the rule-in I diagnosis. Please refer to Table 2 for the names of the diagnoses. Mowat-Wilson

Rule in I Rule in II Rule in III

Goldenhar

1a

1b

1c

1d

1e

2a

2b

2c

2d

2e

2f

2g

2h

2i

0.787

0.681 0.653 0.653

0.729 0.729

0.740 0.736 0.729

0.787

0.795

0.751

0.812

0.812

0.762

0.721

0.756

0.706

0.718 0.686 0.680 0.678

Williams Beuren


X-linked mental

3c

3d

3e

0.782

0.782

0.752

0.719 0.711 0.708

Cardiofaciocutaneous

4b

4c

4d

4e

5a

5b

5c

5d

5e

5f

5g

6a

6b

6c

6d

6e

0.802

0.802

0.701 0.667 0.650

0.777

0.812

0.812

0.742 0.733 0.729

0.799 0.752 0.750

0.758 0.748

0.765 0.740

0.745 0.742 0.739

0.750 0.723 0.716

0.809

0.761 0.753

0.809

0.723 0.717 0.706

7a

7b

7c

7d

7e

8a

8b

8c

8d

8e

8f

8g

8h

9a

9b

9c

9d

9e

0.676 0.676 0.665

0.773

0.701 0.697 0.685

0.773

0.672 0.672 0.670

0.663 0.657

0.726

0.726

0.735 0.735 0.735

0.754 0.752 0.747

0.722 0.721 0.721

0.704 0.677 0.632

0.740 0.736 0.734

0.749

0.801

0.807 0.801

0.763

0.694 0.692 0.665

10a

Angelman

Craniofrontonasal

Laron

10b

0.814 0.773

10c

10d

0.749 0.778 0.773

10e

10f

10g

11a

PMSE 11b

0.814 0.713 0.807 0.794 0.805 0.802

Fragile X


3b

0.777

Crisponi


3a

4a

Cohen


Treacher Collins

11c

11d

0.805 0.784

11e

11f

11g

11h

0.793 0.770 0.771 0.767

Pitt-Hopkins

12a

12b

12c

12d

12e

0.677 0.652

0.752 0.751

0.724 0.715 0.710

0.751

0.681 0.663 0.652

Potocki Lupski

13a

13b

13c

13d

13e

13f

14a

14b

14c

14d

14e

15a

15b

15c

15d

15e

15f

15g

0.688 0.683

0.698

0.771

0.771

0.742 0.736 0.728

0.640

0.799 0.774 0.771

0.765 0.731

0.780 0.774 0.771

0.719

0.726 0.699 0.698

0.760

0.771 0.761

0.778 0.772

0.740 0.738

0.787 0.784 0.776

0.756 0.755 0.746

0.732 0.724 0.717

Fig. 7. Graphical representation of the syndrome diagnosis success rates obtained using the proposed methodology based on the rule-in I, II, and III diagnoses. The success rates based on rule-in III were as follows: Mowat-Wilson (80%), Goldenhar (100%), Treacher Collins (60%), Williams Beuren (80%), X-linked mental (71.4%), Cardiofaciocutaneous (60%), Cohen (60%), Angelman (75%), Craniofrontonasal (80%), Crisponi (100%), Laron (100%), PMSE (100%), Fragile X (83.3%), Pitt-Hopkins (80%), and Potocki Lupski (85.7%), where the overall average success was 83%.


G Model

ARTICLE IN PRESS


K. Kuru et al. / Artificial Intelligence in Medicine xxx (2014) xxx–xxx

10

Table 2 Diagnosed syndrome names based on the rule-in I, II, and III diagnoses, where the highest values are shown for the comparisons depicted in Table 1. The gray cells correspond to the correct diagnoses. Mowat-Wilson


1a √

Goldenhar

1b

1c

1d

PMSE Willi Golde

X-link √

Cardi Fragi √

1e √

Williams Beuren


4a √

4b √

4c √

4d

4e √

Angel Fragi Potoc

5a √

5b √

7b √

Crisp Crani Angel

7c

7d √

Crisp Golde Laron

7e

8a

Crisp Potoc √

PMSE √

Crisponi



Potoc √

2d √

2e √

2f √

2g √

2h √

2i √

3a

10c √

10d Potoc √

10e √

10f √

10g

Cardi Crani Mowat

5c

5d

5e

5f

5g

6a

Fragi X-link Mowat

PittCrisp Golde

Crisp √

Pitt√

Cardi Fragi √

X-link Crani Mowat

Golde √

8c √

8d

8e

8f

8g

8h

Golde Crisp Potoc

Crani Potoc Crisp

Potoc Golde √

Fragi PMSE √

Potoc PMSE √

13c √

13d √

13e X-link Mowat Cardi

13f √

3d √

3e Golde Crani Cardi

6b √

6c Pitt√

6d √

6e PMSE X-link Treac

9a √

9b √

9c Cardi √

9d √

9e

12d √

12e

Angel Crisp Cohen

PMSE 11b √

11c √

11d √

11e √

11f √

Pitt-Hopkins 13b √

3c √

Craniofrontonasal

8b √

11a √

3b √

Cardiofaciocutaneous

Laron

10b √

Fragile X 13a

2c √

Angelman

7a

10a √

Treacher Collins

2b √

X-linked Mental

Cohen


2a √

11g √

11h √

12a

12b

12c

Angel √

Crisp √

Potoc X-link √

Mowat Angel √

Potocki Lupski

14a

14b

14c

X-link Crisp Potoc

X-link √

Golde X-link √

14d √

14e Potoc Willi √

15a √

15b

15c

15d

15e

15f

15g

Pitt√

Crisp √

Angel √

Laron Crani √

Crisp Crani Pitt-

Golde Crisp √

Fig. 8. Graphical representation of the syndrome diagnosis success rates obtained by five specialists (two pediatricians and three medical geneticists) based on the rule-in I, II, and III diagnoses. The success rates based on rule-in III were as follows: Mowat-Wilson (36%), Goldenhar (64.4%), Treacher Collins (52%), Williams Beuren (32%), X-linked mental (37%), Cardiofaciocutaneous (32%), Cohen (36%), Angelman (50%), Craniofrontonasal (72%), Crisponi (34%), Laron (72.5%), PMSE (36%), Fragile X (73.3%), Pitt-Hopkins (60%), and Potocki Lupski (54%), and the overall average was 50.6%.

used to evaluate the differences in the results obtained in the cross-validation assessment, as described by Witten [55]. A test of normality was performed using SPSS 7 via the “Explore” function

7 Statistical and computational software tool, SPSS, version 17, SPSS Inc., Chicago, Illinois, USA.

in the “Analyze-Descriptive Statistics” menu, which showed that the results were normally distributed because the p-value was greater than 0.05 [57]. Thus, a parametric t-test analysis was performed using SPSS to compare the results obtained by the proposed methodology with the results of the specialists. The null hypothesis was “there is no significant difference in the diagnosis success of the method compared with the pre-test



ARTICLE IN PRESS p

0.0002 0.0000 0.0003 0.0000 0.0007 0.0000 0.0003 0.0000 0.0005 0.0000 0.0000 0.0000 0.0000 0.0002 0.0001 65.600 NaN 129.003 169.996 21.111 NaN 129.003 39.000 45.714 NaN NaN NaN 80.999 65.600 40.364

LR+/LR− LR−

0.212 0.000 0.405 0.205 0.320 0.400 0.405 0.269 0.218 0.000 0.000 0.000 0.177 0.212 0.164 13.920 9.222 52.201 34.799 6.746 NaN 52.201 10.500 9.943 6.538 41.999 17.400 14.333 13.920 6.623

LR+ Type II (ˇ)

0.200 0.000 0.400 0.200 0.286 0.400 0.400 0.250 0.200 0.000 0.000 0.000 0.167 0.200 0.143 0.057 0.108 0.011 0.023 0.106 0.000 0.011 0.071 0.080 0.153 0.024 0.057 0.058 0.057 0.129

Type I (˛) NPV

0.988 1.000 0.977 0.988 0.974 0.978 0.977 0.975 0.988 1.000 1.000 1.000 0.988 0.988 0.987 0.444 0.500 0.750 0.667 0.357 1.000 0.750 0.500 0.364 0.350 0.800 0.500 0.500 0.444 0.353

PPV Sp

0.943 0.892 0.989 0.977 0.894 1.000 0.989 0.929 0.920 0.847 0.976 0.943 0.942 0.943 0.871 0.800 1.000 0.600 0.800 0.714 0.600 0.600 0.750 0.800 1.000 1.000 1.000 0.833 0.800 0.857

Se # of cases

92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 82 74 86 85 76 87 86 78 80 72 82 82 81 82 74

TN FP

5 9 1 2 9 0 1 6 7 13 2 5 5 5 11 1 0 2 1 2 2 2 2 1 0 0 0 1 1 1 4 9 3 4 5 3 3 6 4 7 8 5 5 4 6

FN TP

The statistical analysis of the performance of the proposed method indicates that the results agreed with the phenotype–genotype correlations reported in previous studies of

Mowat-Wilson Goldenhar Treacher Collins Williams Beuren X-linked Mental Cardiofaciocutaneous Cohen Angelman Craniofrontonasal Crisponi Laron PMSE Fragile X Pitt-Hopkins Potocki Lupski

4. Discussion

11

Syndromes

diagnosis success of the specialists in terms of the rule-in III diagnoses ( = 0 ).” The paired-samples t-test showed that the results obtained by the proposed method were significantly (rejecting the null hypothesis) different to the test results, where p < 0.01. This strongly suggests that the results obtained using the proposed method were significantly better than those of the specialists. Finally, we analyzed the success rates on a syndrome basis. The statistical analysis of the diagnostic tests for each syndrome based on the rule-in I, II, and III diagnoses are shown in Table 3, where all other cases with syndromes that differed from the processed syndromes were denoted by “condition negative.” The sensitivity (Se) and specificity (Sp) values of the overall system were 0.857 and 0.870, respectively. In particular, the Se values for three syndromes were relatively low, i.e., Treacher Collins, Cardiofaciocutaneous, and Cohen. However, the Se values of the other 12 syndromes were satisfactory. Goldenhar, Crisponi, Laron, and PMSE syndromes had Se values of 1.00, which means that all of the positive cases were diagnosed correctly. The Sp values for all syndromes were adequate, but these values might not lead to reliable conclusions because the negative conditions outnumbered the positive conditions, where all other cases with syndromes that differed from the processed syndrome were denoted by “condition negative.” Therefore, other measures should be considered such as the positive predictive value (PPV), negative predictive value (NPV), Type I error (˛), Type II error (ˇ), likelihood ratio (LR)+, LR−, and LR+/LR−-−. Evaluations of these measurements are presented in the following. A strength of the proposed method is its very high NPV (NPV > 0.975), thus if a negative result is obtained for an individual, there is very high confidence that this negative result will be true when using the method as a negative screening test. Thus, a negative result is very good evidence that a patient does not have a syndrome. However, the Mowat-Wilson, X-linked mental, Craniofrontonasal, Crisponi, Pitt-Hopkins, and Potocki Lupski syndromes were confirmed poorly (PPV < 50%), so further investigations should be performed in terms of employing the proposed method as a positive screening test. The results obtained using the proposed method had low error rates (Type I error (˛) and Type II error (ˇ)), especially with ˛ values. Increasing ˛ will reduce ˇ, and vice versa, given a fixed sample size. After ˛ has been set, the only way to decrease ˇ is to increase the sample size. Thus, the relatively high ˇ values for some syndromes, such as Treacher Collins, Cardiofaciocutaneous, and Cohen, suggest that incorporating more cases with the syndromes would yield lower ˇ errors. Furthermore, LRs are useful statistics for summarizing the diagnostic accuracy because they have several particularly powerful properties, which make them more clinically useful than other statistics [58], e.g., the test is better when LR+ is higher and when LR− is smaller, thus if LR+ is high and LR− is small, it is probably a good test. It is considered that LR+/LR− values

An Automated and Intelligent Medical Decision Support System for Brain MRI Scans Classification.

Intelligent Techniques Using Molecular Data Analysis in Leukaemia: An Opportunity for Personalized Medicine Support System.

An intelligent clinical decision support system for patient-specific predictions to improve cervical intraepithelial neoplasia detection.

Decision support system and medical liability.

An Intelligent Decision System for Intraoperative Somatosensory Evoked Potential Monitoring.

Autonomous Driver Based on an Intelligent System of Decision-Making.

DXplain--demonstration and discussion of a diagnostic decision support system.

Using multicriteria decision analysis to support research priority setting in biomedical translational research projects.

Person centered prediction of survival in population based screening program by an intelligent clinical decision support system.

Literature review on clinical decision support system reducing medical error.

American Board of Medical Genetics restructuring: make an informed decision.

Diagnostic accuracy of GPs when using an early-intervention decision support system: a high-fidelity simulation.

The sensitivity of medical diagnostic decision-support knowledge bases in delineating appropriate terms to document in the medical record.

E-servant: an intelligent, programmable system to support and integrate assisted living technologies.

Towards an intelligent framework for multimodal affective data analysis.

Support staff: Build reward system for ace technicians.

A fitness analysis system with an intelligent interface.

Biomedical signal and image processing for clinical decision support systems.

VineSens: An Eco-Smart Decision-Support Viticulture System.

PathOS: a decision support system for reporting high throughput sequencing of cancers in clinical diagnostic laboratories.

Medical command errors in an urban advanced life support system.

Data and the clinical decision support loop.

Sharing Data to Build a Medical Information Commons: From Bermuda to the Global Alliance.

Rapid learning in practice: a lung cancer survival decision support system in routine patient care data.