Computers in Biology and Medicine 62 (2015) 294–305

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

Assessing diagnostic complexity: An image feature-based strategy to reduce annotation costs Jose R. Zamacona, Ronald Niehaus, Alexander Rasin n, Jacob D. Furst, Daniela S. Raicu School of Computing, DePaul University, Chicago, USA

art ic l e i nf o

a b s t r a c t

Article history: Received 16 February 2014 Accepted 14 January 2015

Computer-aided diagnosis systems can play an important role in lowering the workload of clinical radiologists and reducing costs by automatically analyzing vast amounts of image data and providing meaningful and timely insights during the decision making process. In this paper, we present strategies on how to better manage the limited time of clinical radiologists in conjunction with predictive model diagnosis. We first introduce a metric for discriminating between the different categories of diagnostic complexity (such as easy versus hard) encountered when interpreting CT scans. Second, we propose to learn the diagnostic complexity using a classification approach based on low-level image features automatically extracted from pixel data. We then show how this classification can be used to decide how to best allocate additional radiologists to interpret a case based on its diagnosis category. Using a lung nodule image dataset, we determined that, by a simple division of cases into hard and easy to diagnose, the number of interpretations can be distributed to significantly lower the cost with limited loss in prediction accuracy. Furthermore, we show that with just a few low-level image features (18% of the original set) we are able to determine the easy from hard cases for a significant subset (66%) of the lung nodule image data. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Resource allocation Computer-aided diagnosis Image classification

1. Introduction Clinical radiology is at the center of modern medicine and can significantly impact patient outcomes [1]. Numerous advances in medical imaging technology continuously increase the volume and detail of diagnostic data available to radiologists; proliferation of cheap storage allows storing all of this image data indefinitely. However, human capacity to process collections of images has not kept up with the exponential growth in the available diagnostic data. Computer-aided diagnosis (CADx) systems can be used to assist radiologists in the interpretation of medical images. A CADx system classifies detected regions of interest with respect to their likelihood of malignancy and further helps radiologists in recognizing the next step, such as biopsy or short-term follow-up imaging examination. Once the region of interest is detected, a CADx system consists of two main steps: feature extraction and feature-based classification. In the feature extraction step, numerical descriptors quantifying the properties of the regions of interest are calculated and n

Corresponding author. E-mail addresses: [email protected] (J.R. Zamacona), [email protected] (R. Niehaus), [email protected] (A. Rasin), [email protected] (J.D. Furst), [email protected] (D.S. Raicu). http://dx.doi.org/10.1016/j.compbiomed.2015.01.013 0010-4825/& 2015 Elsevier Ltd. All rights reserved.

used to encode the raw pixel data. In the classification step, a machine learning or statistical modeling algorithm uses these image features to train a predictive model based on a dataset of annotated training images. Creating correctly annotated training datasets is a challenging task, especially in the medical imaging domain. When ground truth (such as pathology reports) is not available, the labels/ annotations for the training sets are acquired through experts' interpretations. In many instances, the labels obtained from a single expert are different from the ground truth because of the uncertainty in the image data, lack of additional information at the time of interpretation, and error [2,3]. For these complex cases, panels of experts are preferred to provide an accurate and reliable label although it has been shown that the variability among experts' opinions can further introduce uncertainty in the labels [4]. For example, in a recent study [5] investigating the association of eye gaze pattern and diagnostic error in mammography, it was shown that there are significant differences even among individuals with the same level of training. Furthermore, the same study reported that the human perceived complexity of a case was consistently and moderately correlated with the risk of making a diagnostic error. Therefore, in order to obtain in the most accurate label while making efficient use of resources/experts, it is important to address how many experts to ask to annotate a medical

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

Consensus/Aggregated Label

image until a label consensus is reached and can be used to train and validate CADx systems. The first contribution of the work presented in this paper is in the area of efficient resource allocation, i.e. the process of reducing the overall cost of acquiring annotation labels (needed for the classification step of a CAD system) while maintaining comparable quality of image classification. We start from a hypothesis that some of the pathologies captured in the medical images will be necessarily more difficult to diagnose (or otherwise label) than others. We therefore assume that if we had a discriminator that could identify the different “categories” of medical cases, we could leverage that information and allocate annotation labels in a more efficient manner. Intuitively, a difficult case will often require multiple opinions (which may still not be enough for certainty), while an easy case may only require one or two evaluations. We define a threshold-based concept of estimated difficulty of the consensus of the diagnosis, which guides our custom label acquisition strategy that distributes radiologists' efforts in an efficient, non-uniform manner. Using this concept of case difficulty, we show that additional opinions have a different relative benefit in this context: additional opinions may be more valuable for “hard” cases than they are for “easy” cases or vice versa in some circumstances. Fig. 1 illustrates the relationship between the three concepts that are core to this paper: consensus (agreement among experts' aggregated opinion and CAD), reliable label (an aggregated label that does not change by adding a new label), and diagnostic complexity (difficulty to reach a consensus). The second contribution of the paper is in the area of reducing the semantic gap between low-level image features and high level interpretations of images. Having introduced a resource allocation strategy that relies on image content, we further evaluate the relationship between the low-level image features and the corresponding difficulty category of the image. This will serve two purposes: first, we can use that relationship to automatically predict the relative difficulty of the incoming image with just the low-level image features, without resorting to any of the radiologist opinions; second, we can provide feedback to radiologists and help identify the most important image features that should be guiding their decision. The knowledge of anticipated difficulty to achieve a consensus can also be helpful in choosing how much extra time to spend on each case. Using the NCI Lung Image Database Consortium (LIDC) data [6] where diagnosis annotations are provided by expert radiologists, we show that with just few low-level image features we are able to determine the easy from hard cases for a significant subset of the lung nodule image data. 6 5 reliable label and low complexity level

4 3

reliable label and high complexity level

2

non-reliable label and high complexity level

1 0 0

5

10

15

# of experts Fig. 1. Example of three cases that present different levels of diagnostic complexity and label reliability. The triangle and circle marked lines indicate the acquisition of a reliable label because the aggregated label does not change after a certain number of experts are providing labels. The level of complexity is higher for the circle marked line because the consistency of the label is reached after more experts are asked to provide a label. The square marked line indicates a case with high complexity where, given a certain number of radiologists, a reliable label still cannot be reached.

295

While we present our methodology and results for a particular medical application (lung nodule interpretation in Computed Tomography (CT) images) and diagnosis, same principles will apply for any dataset and medical task where annotations by multiple experts are required. The significance of relating image content to the human perception and cognition was also recently shown in [7] where certain image features where found to be important in predicting diagnostic error for mass interpretation in mammograms. Furthermore, segmentation of lesions can also benefit from the proposed approach given that in many cases the ground truth is derived from only experts' delineated boundaries [8]. Our approach can help determine how many outlines to aggregate (for example, using a p-map approach as defined in [9]) until a reliable region of interest is identified for the lesion. To our best knowledge, the presented work pioneers the idea of resource allocation for diagnostic imaging and shows strong support for the role of low-level image features in assessing diagnostic complexity. The rest of this paper is organized as follows. Section 2 discusses related research and background in the area of resource allocation and computer-aided diagnosis for lung nodules. In Section 3, we provide an introduction to our main dataset, LIDC, and discuss our methodology as well as define the image “difficulty” rating. Sections 4 and 5 present experimental results, and Section 6 concludes our findings and sketches further planned work.

2. Background and related work 2.1. Resource allocation for image annotation Classification mechanisms consider a collection of available item features and, based on the already known item labels, construct a model that can predict the labels for the unknown items. Therefore, resource allocation approaches may focus on reducing the cost of feature acquisition or label acquisition and the affected utility function is the resulting loss in prediction accuracy. Item features do not generally have any uncertainty associated with them. In contrast, only the simplest annotation schemes have no uncertainty at all; if available, the label of any item is known precisely. This may be accomplished by using a single annotator (when labels cannot be wrong) or simulated by requiring a single consensus-based label determined by a panel of annotators (i.e., through voting). Uncertainty in labeling can be introduced by noise, in which the inherent truth of a label is obscured by problems in the labeling process, or by imprecise knowledge of the label itself, in which there is no inherently true label, but the possibility of many labels. In recent years, the emerging popularity of crowdsourcing data acquisition created many situations where label quality is suspect (anonymous non-expert annotators). For example, crowdsourcing services like Amazon's Mechanical Turk [10], Games with a Purpose [11], and reCAPTCHA [12] provide inexpensive ways to acquire labels from a large number of annotators in a short time. Furthermore, websites such as Galaxy Zoo [13] and LabelMe [14] allow the public to label astronomical images and general purpose images, respectively, over the Internet. The bulk of available work had relied on opinion consensus when working with multiple labels. For example, the authors of [15] investigated the benefits of combining labels from multiple sources to improve data quality and achieve consensus. Their work assumes that the ground truth exists and that each annotator has a consistent noise level (e.g., annotator X is correct 75% of the time), which is not the case in our setting. They have concluded that repeated-labeling can improve the overall data quality, even when using noisy labelers and that the naïve round-robin approach for choosing what to label next is rarely the best strategy. A large number of non-expert volunteers were shown to achieve useful

296

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

consensus results in [16], although these results are in a domain where a non-expert could produce good annotations (lunar crater annotation in this particular example) and ultimately such nonexpert analysis serves as a pre-processing step to optimize experts' time. The work in [16] had mentioned but not investigated the possibility of incorporating expert evaluations into the same model thereby introducing multi-tiered cost to label acquisition. [17] directly considers the cost of acquiring features rather than labels; assuming that true labels are already known, but collecting additional item features may be costly. The work in [17] relies on the assumption that each feature has an inherent (independent) predictive value which can be weighed against its respective cost. In [18], studies were performed to analyze the cost of feature acquisition and the implications on the resulting classifier accuracy, concluding that round-robin label acquisition is not an efficient approach. Authors in [18] model the expected benefit of acquiring additional features, which is similar to the first phase of our work (though we look at additional labels and not features). However, in second phase we expand our work to estimate the expected benefit automatically, without looking at the labels. The work in [19] discusses the tradeoffs between acquiring annotation on images and information gain. The authors consider a more general case of labeling image contents, where both annotated and partially annotated images must be considered. Ground truth exists, but different labeling categories are available (e.g., “list items present in the image”, “estimate how difficult is this image to annotate”). The authors in [20] have found that a non-expert can estimate the image annotation difficulty, but it is important to note that difficulty in that context is “how long does it take to label the image?”, which is very easy to observe in practice. Our goal is to identify difficulty as it relates to experts' ability to arrive to an agreement; our difficulty measure had to be defined before it could be evaluated. Incorporating cost of annotation acquisition is similar to the work in [21] which controls the cost of label acquisition by limiting how much time the annotator is allowed to spend on the labeling process (by showing only a fraction of the text that needs to be annotated). In the long run, authors in [21] also intend to consider the incremental benefit of spending more resources on label acquisition to reduce the learning cost. Work in [22] focuses on using Random Forests and efficiently using a pool of individual predictive models for ensemble classification. It is similar to our work in that it investigates the cost of arriving to a consensus, but it once again assumes presence of ground truth and relies on a large pool of 100 trees to converge to an answer. In the same vein, the work in [20] studied strategies for applying crowdsourcing techniques (using tools such as LabelMe [14]) to improve image annotation, in order to reduce manual interaction required from specialists. Their goal is to improve nonexpert labeling output by supplying detailed instructions and providing expert feedback. In [23], the authors use a semisupervised learning algorithm variation called Co-Forest that relies on a combination of labeled and unlabeled items to reduce prediction errors of their models. Their approach relies on the idea that two classifiers can exchange newly predicted labels to mutually improve classification, as long as only the highestconfidence labels are shared. 2.2. Consensus, truth estimation, and computer-aided diagnosis for lung nodules The approach presented in this paper pertains to a particular category of images, where no ground truth is available and annotations have to be provided by experts. Such situation is more common in the medical domain where, before getting pathology reports (forming ground truth), radiologists have to interpret

medical images with respect to their likelihood of malignancy (generating reference truth). For diagnostically difficult cases, several opinions might be necessary, and therefore, additional experts need to be assigned to those cases. Our proposed approach aims to perform annotation assignment to minimize the effort required from the radiologists, while losing as little accuracy as possible (or, alternatively, given a fixed budget build a model with highest feasible accuracy). The approach is similar to the challenges of efficient resource allocation discussed in [24] and in several approaches discussed in this section. However, most of the time second opinion is not necessary since the label provided by an expert is the ground truth. When multiple opinions are used it is to estimate the underlying ground truth value based on unreliable (i.e., non-expert) annotators; in contrast, we have to use current expert consensus as the substitute for ground truth which has the potential to change any time (e.g., suppose first five medical experts agree on a diagnosis and then six more disagree changing the total consensus). Consequently, our definition of “difficulty” is fundamentally different from related work because it refers to the ease of achieving consensus and not to the difficulty of acquiring the label. One difference between these two measures is that the difficulty of acquiring the label is defined by some objective quantifiers (e.g., time to acquire the label, salary paid to the agent acquiring the label), while consensus difficulty cannot be quantified so easily. There is certainly some correlation between these two measures and if LIDC provided time taken for each diagnosis, we could have used it as a valuable predictor variable (it would also make the resource allocation phase of our work more precise). Some of the more expensive to acquire labels will be difficult to agree on, while some of the cheaper labels will correspond to easy consensus. However, it is the other two combinations that would be most interesting to study: the easyto-acquire labels which result in disagreement between radiologists and the difficult-to-acquire labels that quickly coalesce into a consensus. Consensus interpretation of imaging studies is defined as the agreement reached when two or more radiologists report the imaging findings [25]. In the computer-aided diagnosis (CAD) literature, consensus image interpretation is used as a standard of reference to provide the target class label to which the CAD method is compared [26]. For diagnosis of lung nodules, the application domain of this paper, most of the CAD systems use traditional classification techniques such as linear discriminant analysis [27–33], decision trees [34,35], and neural networks [36–39] to learn the class label from nodules' appearance, size, and shape image features. The CAD performance is generally evaluated using receiver operator characteristic (ROC) analysis and area under the ROC is used as a performance index [40,41]. A match between the predicted diagnosis by CAD and the standard reference truth for each one of the cases from the data will result in an area under the curve equal to 1. While all the above CAD studies relate to the prediction of malignancy, there are other studies that look into predicting specific characteristics of lung nodules that are important in the diagnosis process. For example, [42] proposed a patch-based context analysis to differentiate between well-circumscribed, vascularized, juxto-pleural, and pleural tail types of nodules. Further, [43] classified lung nodules into round, lobulated, densely spiculated, ragged, and halo based on their margin characteristic. In [44,45] Raicu et al. used image features to predict spiculation, lobulation, margin, subtlety, sphericity, and texture characteristics of lung nodules. An extensive literature review of CAD systems for lung nodules is presented in the review article by [46]. All of these studies have in common the assumption that a good class label is generated either by a single annotator or by the consensus among multiple annotators. Given that in the clinical practice of radiology

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

consensus is hard to achieve, it is important to be able to determine how many annotators to use until a reliable and accurate label is obtained for each case. In the next section, using the NIH Lung Image Database Consortium Data, we propose a strategy for determining the level of difficulty to reach a consensus and show that image features can be used to determine the level of consensus difficulty.

297

3.2. Feature extraction 3.2.1. Low-level image features The low level image features used in this study are based on our previous work [47] to predict the likelihood of malignancy for lung nodules using the LIDC dataset. We extract a total of 64 lowlevel image features divided into four different feature categories: shape, size, intensity, and texture.

3. Methodology

3.2.1.1. Shape features

3.1. Lung Image Database Consortium (LIDC)

We use eight common image shape features [47]: circularity, roughness, elongation, compactness, eccentricity, solidity, extent, and the standard deviation of the radial distance. Circularity is measured by dividing the circumference of the equivalent area circle by the actual perimeter of the nodule. Roughness can be measured by dividing the perimeter of the region by the convex perimeter. A smooth convex object, such as a perfect circle, will have a roughness of 1.0. The eccentricity is obtained using the ellipse that has the same second-moments as the region. The eccentricity is the ratio of the distance between the foci of the ellipse and its major axis length. The value is between 0 (a perfect circle) and 1 (a line). Solidity is defined in terms of the convex hull corresponding to the region being the proportion of the pixels in the convex hull that are also in the region. Extent is the proportion of the pixels in the bounding box (the smallest rectangle containing the region) that are also in the region. Finally, the RadialDistanceSD is the standard deviation of the distances from every boundary pixel to the centroid of the region.

The most current Computer-Aided Diagnosis (CAD) findings support and extend the need for creating reference standard data sets that can provide the ground truth for building and validating CAD systems. One such dataset is the Lung Image Database Consortium (LIDC) [6] – a diverse and growing collection of Computed Tomography (CT) scans analyzed by four radiologists. Each radiologist provided a contour for the nodule or nodules present in the scan, as well as a set of characteristics for the nodule as a whole (cross sections of the same nodule are generally present on multiple CT scans). These characteristics are lobulation, malignancy, margin, sphericity, spiculation, subtlety, and texture. Each characteristic received a rating on a scale from one to five. Two other characteristics are internal structure and calcification with categorical values between 1 and 4 and 1 and 6, respectively. While the LIDC provides a common framework for training and evaluating CAD algorithms, there are several challenges that the LIDC data presents including the lack of ground truth and the variability among multiple observers as there was no forced consensus among radiologists when assigning ratings for each characteristic (Fig. 2). Furthermore, the number of nodules on which there was agreement among radiologists was small. These challenges presented by the LIDC data open new avenues of applying nontraditional machine learning approaches to the medical imaging decision process and the approaches proposed in this paper augur well for the CAD future. The latest release of the LIDC dataset contained 2669 distinct nodules from 1018 patients. For the purpose of this research, we used the subset of the LIDC data set where all four radiologists marked that a nodule was present in the scan and we used the slice that contained the largest area of the nodule among all the areas delineated by each radiologist on the corresponding CT scan. This led to an 810 nodule data set from 309 patients.

Fig. 2. Example of four different delineations on a slice marked by four different radiologists.

3.2.1.2. Size features We use the following seven features to quantify the size of the nodules [47]: area, ConvexArea, perimeter, ConvexPerimeter, EquivDiameter, MajorAxisLength, MinorAxisLength. The area and perimeter image features measure the actual number of pixels in the region and on the boundary, respectively. The ConvexArea and ConvexPerimeter measure the number of pixels in the convex hull and on the boundary of the convex hull corresponding to the nodule region. EquivDiameter is the diameter of a circle with the same area as the region. Lastly, the MajorAxisLength and MinorAxisLength give the length (in pixels) of the major and minor axes of the ellipse that has the same normalized second central moments as the region. 3.2.1.3. Intensity features The nine gray-level intensity features used in this study are simply the minimum, maximum, mean, and standard deviation of the gray-level intensity of every pixel in each segmented nodule image and the same four values for every background pixel in the bounding box containing each segmented nodule image. Another feature, IntensityDifference, is the absolute value of the difference between the mean of the gray-level intensity of the segmented nodule image and the mean of the gray-level intensity of its background [47]. 3.2.1.4. Texture features Normally, texture analysis can be grouped into four categories [47]: model-based, statistical-based, structural-based, and transformbased methods. Structural approaches seek to understand the hierarchal structure of the image, while statistical methods describe the image using pure numerical analysis of pixel intensity values. Transform approaches generally perform some kind of modification to the image, obtaining a new “response” image that is then analyzed as a representative proxy for the original image, and model-based

298

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

methods are based on the concept of predicting pixel values based on a mathematical model. Model-based methods are based on the concept of predicting pixel values based on a mathematical model. Based on previous work, we focus on three well-known texture analysis techniques: co-occurrence matrices (a statistical-based method), Gabor filters (a transform-based method), and Markov Random Fields (a model based method). Co-occurrence matrices focus on the distributions and relationships of the gray-level intensity of pixels in the image. They are calculated along four directions (01, 451, 901, and 1351) and five distances (1, 2, 3, 4 and 5 pixels) producing 20 co-occurrence matrices. Once the co-occurrence matrices are calculated, eleven Haralick texture descriptors are then calculated from each cooccurrence matrix. Although each Haralick texture descriptor is calculated from each co-occurrence matrix, we averaged the features across all distance/direction pairs resulting in 11 (instead of 11n4n5) Haralick features per image. Gabor filtering is a transform based method which extracts texture information from an image in the form of a response image. A Gabor filter is a sinusoid function modulated by a Gaussian and discretized over orientation and frequency. We convolve the image with 12 Gabor filters: four orientations (01, 451, 901, and 1351) and 3 frequencies (0.3, 0.4, and 0.5), where frequency is the inverse of wavelength. We then calculate means and standard deviations from the 12 response images resulting in 24 Gabor features per image. Markov Random Fields (MRFs) is a model based method which captures the local contextual information of an image. We calculate five features corresponding to 4 orientations (01, 451, 901, 1351) along with the variance. We calculate feature vectors for each pixel by using a 9 estimation window. The mean of 4 different response images and the variance response image are used as our 5 MRF features. 3.2.2. Image annotations The target label was set to the malignancy characteristic of the nodule. There are five possible malignancy ratings r¼1:5 that each radiologist R can assign to a nodule: 1 ¼“Highly Unlikely”, 2 ¼ “Moderately Unlikely”, 3 ¼“Indeterminate”, 4 ¼ “Moderately Suspicious”, 5 ¼“Highly Suspicious”. Since each nodule is annotated by four radiologists, we can investigate how going from one label to four labels per nodule benefits the label and impacts the performance of the CAD system. To simulate the additional acquisition of a label, we consider first only one rating from any of the radiologists, then we add another rating that was not considered already, and so on until all radiologists' ratings are used. The label at each iteration is created by taking the mode of the selected ratings (or average if the mode does not exist). These labels are then used to create four classification models M 1 ; M 2 ; M 3; M 4 . The ratings are not fixed; for instance M 1 does not correspond to any particular single rating, but to a randomly selected single rating. In the general case with k models (k being the number of annotators A, k Z2), the set of annotations (ratings) for an instance I is denoted by n o I I I I l ¼ lA1 ; lA2 ; …; lAk ; ð1Þ I

and each annotation lAi (i¼1:k) is a malignancy rating with values 1 to 5 representing the degree if malignancy as explained above. The set of annotations for each instance I will generate the I instance label lMp for model M p with p ¼ 2::k:     I I I I I I I I lMp ¼ mode lM1 ; lM2 ; …; lMp  1 ; rand l \ lM1 ; lM2 ; …; lMp  1 ð2Þ For example, in the case of the LIDC dataset, a nodule instance that was annotated by four radiologists with (1, 1, 2, 3) can generate the

following sequence of random labels (2), (2,1), (2,1,3), and (2,1,3,1) to be used by M 1 ; M 2 ; M 3 ; and M 4 ;respectively. Next section discusses how the proposed classification approach uses the low-level image features and these labels with the ultimate goal of distinguishing between easy versus hard to diagnose cases. 3.3. Classification models To build the classification models, we used the decision trees approach given their simplicity and no assumptions of the distribution of the data. A further reason to apply decision trees based models is their ability to select top-level attributes with highest differential impact. The top level nodes in the decision tree are the most significant attributes that influence the outcome, and we require this ability to provide intuitive feedback for the radiologists, by recommending which features are most helpful in making a decision. In particular, we used the Classification and Regression Tree (CRT/CART) introduced by Breiman [21]. The CART algorithm is able to take numerical or categorical predictors and target variables and creates binary splits as it grows. It seeks to create the purest splits based on an impurity measure. There are various measures on which to split the tree depending on the type of target variable. If the target variable is categorical then the Gini, Twoing and Ordered Twoing impurity measures can be used. Otherwise, if the target variable is numerical then the Least Squares Deviation (LSD) can be used. The growth of the tree is limited by choosing a maximum depth, and setting values of parent and child to limit the splitting allowed. In this work, we present two categories of prediction models: (1) malignancy-prediction models that will predict combinations of radiologists' ratings based on image features and are the bases for determining the consensus difficulty level, and (2) difficultyprediction models that will predict the difficulty level determined in (1) by using only low-level image features. 3.3.1. Malignancy-prediction models Since the LIDC dataset provides labels by up to four radiologists and we are considering the nodules that receive four labels, we propose to create four different predictive models by keeping the same low-level image features but changing the class label. As we discussed earlier, a similar methodology will apply when there will be more than four radiologists/annotators. The goal of this modeling process is to simulate incremental acquisition of labels. LIDC is a static dataset and therefore we are unable to acquire fresh labels during algorithm execution. Since one of our goals is measuring the value of each additional radiologist opinion, we choose to simulate the incremental process by sampling from the available radiologist opinions without replacement. The four models described below are an approximation for a 1-, 2-, 3- and 4-radiologist evaluation, and these models are used to measure the value of the 2nd, 3rd and 4th opinion. I The first decision tree model M 1 is built using label lM1 that only takes into account one random radiologist malignancy rating. I The second model M 2 is build using label lM2 that takes into account two random radiologist malignancy ratings. Further, the I I third and fourth models use labels lM3 and lM4 generated similarly using formula (2). The ratings were also randomized 20 times to create different trials on which the classification models will select ratings in different sequence in order to build the four different models. Note that only M 1 ; M 2 ; and M 3 are randomized (by selecting one, two and three ratings out of four). We only have four ratings per image, thus M 4 cannot be randomized. The classifier models built in our experiments and trials were evaluated based on the accuracy performance of the testing data

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

sets. To obtain reliable estimates, a stratified sampling validation was used with 66% training and 34% testing for all experiments. The higher the accuracy the better the model did in classifying the malignancy of the nodules. The classifier model M p had to match I the malignancy label based on the low-level features predictedðlMp Þ I with the consensus of the radiologist(s) malignancy ratings lMp to correctly classify a case. 3.3.2. Distinguishing an easy case from a hard case The focus of the first phase of our experiments is in this paper is to demonstrate the disparity between easy and hard consensus cases and to show how such separation can be leveraged to assign annotators in an efficient manner. We define a threshold-based strategy that allows us to draw the distinction between the difficult and easy case and report the resulting performance for a range of thresholds. To avoid introducing too much uncertainty into our results, we generate results using the binary easy/hard classification and do not use a dynamic classifier for identifying hard or easy case (which would cause some of the easy cases to be re-classified as hard or vice versa, based on new information). A more fine-grained sample partition (e.g., easy, medium, hard) is beyond the scope of this paper. In fact, the cases that switch affiliation between easy and hard category based on changing value of the threshold are the obvious candidates for the midrange difficulty category. In order to find a cost effective solution for the number of raters required for each case, we need to differentiate between what constitutes an easy or a hard case for the purposes of rating. We propose to differentiate easy and hard cases using a threshold value based on the distribution of the case error variance across all classification models:   I I error M p ¼ j lMp  predictedðlMp Þj ð3Þ k 1 X var ðI Þ ¼ n ðerrorðM p Þ  meanðerrorÞÞ2 k p¼1

ð4Þ

For example, if all four models were off by 1, then the variance of that case would be 0. If three models correctly classify a nodule but one model misses by 1, the variance would be 0.25. For a threshold equal to zero, the first example will be assigned in the easy category while the second one will be assigned to the hard category. Fig. 3 provides a visual overview our easy versus hard modeling approach. 3.3.3. Reducing classification costs Once the cases have been separated into the easy and hard categories, we can leverage that information to reduce the cost of annotation acquisition. Recall that in this setting (no ground truth is available), there is no limit to how many labels can be used for every instance. Although in our experiments we simulate cost

299

reduction by using a subset of the available labels (4 per instance in our data set), in practice the saved resources will be applied to other instances or to improve accuracy by acquiring more than 4 labels. The principle behind determining the cost-efficient way of distributing labels is based on the estimated marginal utility of the label. Instead of focusing on the absolute accuracy of the particular model, we focus our attention on the relative increase of the accuracy that results from adding a new label. We approximate the increase in accuracy for each model within each case category and use that information to decide which labels have more value than others. For example, if we observe that model M 1 accuracy for hard cases is 50% and model M 2 accuracy is 56%, then the estimated marginal utility of the second label for hard cases is 6% (56–50). Alternatively, if model M 1 accuracy is 70% for easy cases and model M 2 accuracy for easy cases is 71%, then the marginal accuracy is 1% (71–70) when dealing with easy cases. Note that the absolute accuracy of prediction for easy cases may be higher, yet the marginal benefit of the second label for hard cases is actually more valuable (6% gain versus 1% gain, regardless of the absolute values).

3.3.4. Difficulty-prediction models Once each case has a difficulty-based label associated with it, we investigate the relevance of low-level image features in classifying the two levels of diagnostic complexity: easy versus hard to converge to a consensus. If a radiologist knew which cases are more ambiguous (difficult to agree on), they would know which cases require extra attention. Since we are interested in determining the importance of a type of feature rather than an individual feature, we aggregate the twenty four Gabor texture features calculated across different directions and frequencies under 2 averaged Gabor features (one for direction and one for frequency) and build models based on 42 image features (instead of the original 64 image features). Furthermore, given that the threshold-based approach produces an imbalanced dataset of easy and hard cases, we consider two data configurations when building the difficulty-prediction models. One is the unbalanced configuration as generated directly by the threshold-based approach and the other one is the balanced dataset where a stratified sampling is applied on the unbalanced dataset to obtain an equal number of cases from each level of difficulty. The feature importance is calculated using Gini index for each of the two configurations as follows: predictor importance associated with a particular node split is computed as the difference between the impurity for the parent node and the total impurity for the two children. This value is averaged over all nodes that use the predictor. Decision trees were also built to predict the level of difficulty by just using the most important features and their performance was

Fig. 3. Case difficulty definition based on annotator agreement variance.

300

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

compared with the one obtained when using all image features To obtain reliable estimates given the smaller balanced dataset, crossvalidation was performed for training on 80% of the data and 20% was used for testing.

4. Malignancy-prediction results 4.1. Determining easy versus hard cases To create the classifiers we divided the cases into training/ testing subsets using stratified sampling and conducted 20 random independent trials, building each of the four different models M 1 ; M 2 ; M 3 ; and M 4 : When creating the decision trees, the growth limit of the tree was set to depth 12 which provided enough depth for all of the trees in the experiments to reach their full growth. The impurity measure, minimum number of parent nodes and child nodes were varied as and the values that produced the best accuracies on testing were selected. The average accuracy of the classifier and the corresponding confidence intervals across the 20 trials are presented in Table 1 and Fig. 4, respectively. We conducted a compare means analysis at a significance level of 0.05 to record if there was any statistical significance between the average results across the four models. In the testing set, we found incremental accuracy improvements as the number of radiologist ratings increased from one to four with no changes to the classifier configurations and the same image features. As one would expect, the marginal benefit of each additional label acquisition decreases as we move from Model M 1 to Model M 4 . Easy cases become settled quickly and hard cases are difficult to agree on (if it is even possible to come to a consensus). Once a threshold is chosen as described in Section 3.3.2, the cases that have lower than or equal variance are considered “easy” and cases with higher variance are ranked as “hard”. The distribution of the easy versus hard cases for three different thresholds (0, 0.25, and 0.33) using the model-prediction variance criterion is shown in Table 2; in particular, it presents the case distribution from trial 6, which was closest to the average results of the 20 trials that were run (all of the other trials show a similar distribution). We do not report results for the split at threshold value of 0.67 or at higher thresholds because too few cases are rated as “hard” – with only 91 cases out of 810 rated as “hard” the predictive model input would not be sufficiently balanced. In Section 5 we include experimental results with sampled input, because even at lower thresholds the data split is not evenly balanced. As our results in that section show, none of the thresholds that we consider is “optimal” as each has some advantage at different stages of the resource allocation process. The same analysis that was performed on the 810 cases was then repeated for the easy cases set (285 cases for threshold of 0, 556 cases for threshold of 0.25 and 666 cases for threshold value of 0.33) and for the hard cases (625 cases, 254 cases and 144 cases, respectively – the total always remains at 810). The results obtained with 20 random independent trials show that the Decision Tree Classifiers are statistically significantly better on the easy cases (Table 3) than in the hard cases (Table 4). Table 1 Average classifier accuracy (20 trials) for the enitire data (no separation into easy versus hard cases).

Training results Testing results n

M1

M2

M3

M4

57.4 44.1

63.2n 53.5n

66.3n 56.5n

65.6 57.4

Accuracy statistically significant from previous model.

Fig. 4. Classifier accuracy results for the 20 trials across 4 models for the entire data set.

Table 2 CRT Tree parameters used to report the accuracies from Table 1 and number of cases based on the threshold value defined in Section 3.3.2 Case assignment

Cases

All cases

810

Min parent node 28

Min child node 14

Random split

Easy Hard

405 405

18 18

9 9

Threshold¼ 0

Easy Hard

285 525

16 20

8 10

Threshold¼ 0.25

Easy Hard

556 254

28 16

14 8

Threshold¼ 0.33

Easy Hard

666 144

28 18

14 9

Table 3 Testing accuracy values (average of 20 trials) across 4 models for easy cases.

Threshold 0 Threshold 0.25 Threshold 0.33 n

M1

M2

M3

M4

56.5 50.2 43.8

67.4n 61.0n 56.9n

72.7n 62.3n 60.2n

74.9n 63.9n 60.1

Accuracy statistically significant from previous model.

Table 4 Testing accuracy values (average of 20 trials) across 4 models for hard cases.

Threshold 0 Threshold 0.25 Threshold 0.33 n

M1

M2

M3

M4

35.4 36.4 34.3

45.6n 41.4n 33.6

46.3 42.1 43.4n

43.2n 45.3n 47.3n

Accuracy statistically significant from previous model.

The marginal accuracy benefit is of particular interest to us, because we are looking for the most strategic allocation of our labels. Note that the marginal accuracy gain varies based on a particular model value and threshold value. Some of the trends are immediately intuitive – for example, easy cases tend to gain accuracy quickly (going from M 1 to M 2 ) in Table 3. The hard cases gain less accuracy at a similar stage (practically none at threshold of 0.33 and less at thresholds of 0.25 and 0.0 as seen in Table 4).

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

Table 4 also indicates that if we consider acquiring more than 2 labels per hard case then at least M 4 or possibly more than 4 labels are needed. For thresholds of 0.0 and 0.25, there is no noticeable improvement from going from M 2 to M 3 . Going from M 3 to M 4 actually causes a decrease in observed accuracy for the threshold value of 0.0. In addition to applying threshold-based data partition, the original data set was also split into two equal subsets of 405 cases chosen at random; the purpose of that split was to perform the same cost-cutting actions randomly and compare results against the classification accuracy that relies on our definition of easy and hard cases. The fifty-fifty split was chosen because without a mechanism to identify easy versus hard cases, the case distribution is not known. The 405/405 split will enable us to randomly reduce annotation cost by removing some of the labels from the pool. Fig. 5 summarizes the confidence interval accuracy malignancy results for the testing sets of the 9 different experiments completed: all cases (810), threshold-based easy cases (666, 556 and 285), threshold-based hard cases (144, 254 and 625 cases), and two random splits of the data (each containing 405 cases). Four different models were created in each experiment and each model corresponds to the number of radiologists rating a nodule image. The results show that our selection of easy cases for all thresholds has a higher accuracy and a tighter confidence interval curve than a random selection of cases or the subset with hard cases. Further, the easy cases at threshold of 0.25 have a higher accuracy than those at threshold of 0.33 and the easy cases at threshold of 0.0 have an even higher accuracy with tight bounds. The results are as expected – as we continue to decrease the threshold value, the set of easy cases shrinks, keeping only the “easiest” of the easy. Likewise, the hard case curve shows a different curve than random selection (hard cases are more difficult to identify as compared to either easy ones or random ones). For threshold of 0.33 (most easy and fewest hard cases), the accuracy is very low and M 2 prediction is unable to improve the accuracy of hard cases. For all models (except M 1 ) dealing with easy cases results in a slightly higher accuracy, while hard cases result in much lower accuracy. The disparity is accounted by the fact that for that particular threshold (0.33), there are significantly more of easy cases (82% versus 18%). Threshold of 0.25 does somewhat better albeit with wider margins. Finally, threshold of 0.0 for hard cases does even better with tighter margins than 0.25. These trends matched our expectations as we started with 144 hardest cases at threshold of 0.33 and by time we reach the threshold of 0.0, 625 out of 810 are

301

considered to be hard cases. The average accuracy improves as the really hard cases are diluted by more borderline ones. Two of the random half-subsets (RA and RB) with 405 images result in a distribution of accuracies that is very similar to “All Cases” with the entire dataset consisting of 810 images. The subset that were identified as easy and hard tell a different story. For each of our models (with the exception of M 1 for easy cases), the accuracies diverge. Easy cases result in a better overall accuracy, while the hard cases produce lower accuracy. As we have theorized earlier, fewer ratings are necessary to acquire a good accuracy when dealing with an easy case. Moreover, easy cases can achieve higher classification accuracy as compared to the overall data set and hard cases hit their accuracy ceiling at a much lower accuracy value. 4.2. Cost versus accuracy results We further compared the cost efficiency of the split based on our definition of easy/hard cases to a random split of the same cases. Table 5 illustrates some of the accuracy cost points for the random split and for a threshold value of 0.33 (the analysis for other threshold values is similar); first column in Table 5 refers to a particular combination of cases using the specified number of radiologist ratings for each subset., Threshold-based split defines an easy and a hard case subset, while random split is simply a random partitioning into two equal subsets. For example, E2_H1 refers to using two radiologist ratings for the easy subset and one rating for the hard data set; similarly, the random split RA2_RB1 refers to two radiologist ratings for the Random 405 Half A set and one radiologist rating the Random 405 Half B set. Note that this labeling is consistent with results discussed with Fig. 5: Min and Max points are equivalent to “All Cases” testing results from Fig. 5 with Min referring to one rating and Max to four ratings. The split threshold is irrelevant if both subsets receive the same number of annotations (e.g., E4_H4 is equivalent to four ratings for every existing instance, no matter what metric we use for splitting). The cost rating in Table 5 refers to the combined number of radiologist ratings used to build the classifier. The Min cost is equivalent to a single radiologist rating for each of the 810 cases (810  1 ¼810) and the Max cost is having all four radiologists rate all of the 810 cases (810  4 ¼3240). The accuracy was estimated using extrapolation with the marginal accuracy estimates per category (using the experiments that measured accuracies for M 1 through M 4 for easy and hard subsets individually). We have verified several of the points by rerunning the experiments and confirming that our extrapolated values are accurate within 0.1–0.2%. Additional combinations (such as E1_H4) are possible, but they have not provided any benefit and are not part of the curve in Fig. 6. Our results shown in Fig. 6 demonstrate that our definition of hard and easy has a superior cost effectiveness compared to a Table 5 Example annotation combination strategies and their corresponding costs and accuracies.

Fig. 5. Confidence interval results for accuracy percentage for all data sets (all cases, threshold-based partitions, random partitions).

Combination

Cost

Accuracy

Data Set

Min Min RA2_RB1 E2_H1 E2_H2 RA2_RB2 E3_H4 RA3_RB4 Max Max

810 810 1215 1476 1620 1620 2574 2835 3240 3240

0.44 0.44 0.49 0.55 0.53 0.53 0.60 0.55 0.57 0.57

Threshold 0.33 Random split Random split Threshold 0.33 Threshold 0.33 Random split Threshold 0.33 Random split Threshold 0.33 Random split

302

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

Fig. 6. Annotation cost to accuracy trade-off for different threshold values.

Table 6 Example annotation combination strategies. Feature set

Tree parameters

Training

Testing

Parameter set index

Min cases per parent

Min cases per leaf

Easy

Hard

Overall

Easy

Hard

Overall

Balanced

42 21 8

4 4 1

24 24 18

12 12 9

60.2 60.6 61.2

60.6 61.0 61.4

60.3 60.6 61.2

60.1 61.5 62.5

61.3 62.4 62.7

61.0 61.8 62.4

Unbalanced

42 21 8

6 6 4

28 28 24

14 14 12

51.0 50.0 52.7

72.1 71.7 72.5

65.1 64.6 66.2

52.0 51.6 53.1

72.5 71.9 72.6

65.9 65.6 66.5

random selection of cases. By successfully identifying easy cases we are able to to reduce the amount of expert opinions needed to classify a case. In this specific scenario, the efficiency of our case selection is accentuated at the higher costs levels. For example, at the higher cost levels between 2000 and 3240 (max cost), the accuracy of the random split selection is flat even after adding additional annotations to improve the accuracy (i.e., approximately 1000 annotations can be skipped without a noticeable loss in accuracy). Two of our threshold based approach show significant accuracy improvements as more cost is added to the model. Threshold 0.33 shows a 3.7% accuracy improvement from a cost of 2000 to 2600. Threshold 0 shows a 1.5% improvement from a cost of 2000 to 2700. Even our third threshold provides some benefit from a cost of 2000 to 2400. The random split selection has 0% improvement from a cost of 2000 to 2800. We note that each threshold generates some points that do better than other thresholds at different annotation cost points. This further reiterates our argument that no general rule can be applied and when allocating annotators one has to consider several different factors in choosing the best answer.

5. Difficulty-prediction results Having presented a resource allocation strategy that relies on threshold-based case partioning, we will now evaluate the relationship between the image features described in Section 3.2 and the corresponding difficulty category of the image. We report the results for a threshold value of zero (285 easy cases and 525 hard cases) and will discuss the issue with borderline cases. Note that low variance values (intuitively “borderline” cases) will cause the bulk of the Balanced distribution selection maintains an equal

number of data points by randomly selecting 285 cases (from 525), ensuring that the easy/hard cases are represented equally during training and testing. In contrast, unbalanced approach uses all 810 cases to train and test the classifier. In order to determine the best classification trees for analysis several candidate parameter sets consisting of minimum cases per leaf and minimum cases per parent were tested. For each candidate parameter set 100 trials were conducted. Each trial consisted of a random 80/20% split of the data with 80% used for training with 10 fold cross-validation and 20% set aside for testing. The tree with the best cross-validation accuracy for each trial was used for testing for that trial. The union of the most important features from the two data distributions (balanced and unbalanced) resulted in a set of 21 features that included 8 uncorrelated features. Then three classification models were created on all features (42), most important features (21), and most important uncorrelated features (8). Table 6 shows the results for the combination of trees parameters that resulted in the highest average accuracy for balanced and unbalanced datasets, and for each feature set. Table 7 shows the predictor importance for the trees produced with balanced and unbalanced sampling procedures. In both cases the importance of area, contrast, and entropy are an order of magnitude larger than the importance of the other features. This indicates that these features play a large role in distinguishing between easy and hard cases. While the above analysis provides feature importance, we are also interested to understand how these features (1) are combined to produce classification rules for easy versus hard, and (2) group malignant cases under each level of diagnosis complexity. To accomplish this we identify any leaf nodes with high purity (indicating good discrimination between easy and hard) and high probability (indicating that the combination of

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

features and values associated with this discrimination occurred often in the dataset). We anticipated that the features and values associated with the path from the base of the tree to these nodes may suggest simple rules for determining easy from hard for a significant subset of the data. We conducted this analysis for classification models produced using the balanced and unbalanced sampling procedure, but we found that the two models were substantially the same regarding the characteristics of high purity and high probability leaf nodes. Therefore in Tables 8 and 9 we report examples of these leaf nodes produced with balanced sampling only. Leaf node statistics reported in the table are based on the training sample. Malignancy rating statistics and classification agreement with threshold criteria are based on using the model to evaluate all 810 cases. Some general rules are evident from the easy leaf nodes reported in Table 8. First, leaf node number 6 indicates that small

Table 7 Predictor importance for balanced and unbalanced sampling procedures. Variables

Balanced

Unbalanced

Area Contrast Entropy Homogeneity MaxIntensityBG Elongation GaborSD_All Markov2

2.11E  03 1.77E 03 1.73E  03 7.45E  04 6.83E  04 9.47E  04 5.95E  04 0

1.53E  03 2.13E  03 1.67E 03 0 5.73E 04 8.52E  04 3.72E 04 3.90E  04

303

nodules with high contrast are classified as easy and have very low malignancy ratings. The model selects these cases based on there being less than 168 total pixels in the nodule segmentation and the contrast being high (value greater than 167.5). Yet the rules are not always this simple, as indicated by leaf Node 33. In this case nodules with less contrast are also considered easy and probably benign as long as other criteria are satisfied, such as that they are very small and the entropy (a measure of randomness) is low. Finally, leaf node 18 reported in Table 8 indicates that not all easy nodules are benign. Nodules with large area and high entropy can be classified as easy, but also have high probability of malignancy (average mode of malignancy rating is 4.65 out of 5). Table 9 shows results for leaf nodes that classify hard cases. For all three leaf nodes reported in Table 9 low contrast is used to identify hard cases, testifying to the importance of this criteria to determining hard cases. Leaf nodes 17 and 25 indicate that nodules with low contrast, large area, and low entropy are hard to assess but are more likely to be assessed as malignant rather than benign (average malignancy rating modes of 3.69 and 3.59, respectively). These two leaf nodes differ only in how they use homogeneity feature. On the other hand, node 13 shows a case where small nodules with indeterminate malignancy are hard to assess if their contrast is low. In this case, the average mode of malignancy ratings is 2.73 out of 5. The accuracy of predicting difficulty in Table 6 may appear low – however, we must consider what we are predicting in this scenario. When generating our input dataset, we have established a threshold separating easy and hard cases; in this case the threshold is set to zero, placing all cases with variance of 0.25

Table 8 Examples of Easy Leaf Nodes: examples selected from leaf nodes with high purity and probability from the classification tree model trained with balanced sampling. Leaf node number 6

Leaf node statistics

Features

Maligancy rating statistics

Classification agreement with threshold criteria

Easy: 42 Hard: 7 Purity:0.86 Prob:0.12

Contrast4 0.125

Ave Mode: 1.28

63 out of 78 cases

ID 360

Areao 167.5

Ave Var.: 0.55 15 out of 24 cases

ID 1830

33

Easy: 10 Hard: 0 Purity: 1 Prob: 0.02

Contrasto 0.075 Areao 131 Entropy o0.555 Elongation 41.4

Ave Mode: 2.71

18

Easy: 18 Hard: 3 Purity: 0.86 Prob: 0.05

Contrasto 0.125 Area4131 Entropy 40.755 Gabor_SD o 0.495

Ave Mode: 4.65

Example of nodules

Ave Var.: 0.49 29 out of 49 cases

Ave Var.: 1.12

ID 4

Table 9 Examples of Hard Leaf Nodes: examples selected from leaf nodes with high purity and probability from the classification tree model trained with balanced sampling. Leaf node

Node statistics

Features

Maligancy rating statistics

Classification agreement with threhsold criteria

17

Easy: 8 Hard: 59 Purity: 0.88 Prob: 0.16

Contrast o 0.125 Area4 131 Entropyo 0.755 Homogeneity 40.255

Ave Mode: 3.69

141 out of 150 cases

Easy: 1 Hard: 10 Purity: 0.91 Prob: 0.03

Contrast o 0.125 Area4 309 Entropyo 0.755 Homogeneity o 0.255

Easy: 4 Hard: 19 Purity: 0.83 Prob: 0.06

Contrast o 0.125 Areao 131 Elongationo 1.165 MaxIntBG40.465

25

13

Example of nodules

ID 425

Ave Var.: 1.27 Ave Mode:3.59

27 out of 29 cases

ID 5132

Ave Var.: 1.33 Ave Mode: 2.73 Ave Var.: 0.79

28 out of 37 cases

ID 1483

304

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

Table 10 Confusion matrices, 42 features; the percentages show the distribution of the hard cases with respect to the different values for the variance threshold V. Unbalanced

Balanced

Actual (rows) versus Predicted (columns)

Easy

Hard

Easy

Hard

Easy Hard

185 80 20% V ¼ 0.25

99 445 80% V ¼0.25

231 186 37% V ¼ 0.25

53 339 63% V ¼ 0.25

8% V ¼ 0.33

92% V ¼0.33

33% V ¼0.33

67% V ¼0.33

6% V ¼ 0.67

94% V ¼0.67

26% V ¼ 0.67

74% V ¼ 0.67

and higher into the hard category. The variance measure of 0.68 (or higher) is, however, a “harder” hard case than that with a variance of 0.25 or 0.33. No matter where we set the threshold, cases with 0.25–0.33 variance will remain borderline and more likely to be confused with the cases from the other category. There are 381 cases that have the potential to transition between easy and hard, depending on threshold setting. Table 10 shows the confusion matrix break down within the classification distribution by different categories for the 42-feature decision trees. While the above analysis provides feature importance, we are also interested to understand how these features (1) are combined to produce classification rules for easy versus hard, and (2) group malignant cases under each level of diagnosis complexity. To accomplish this we identify any leaf nodes with high purity (indicating good discrimination between easy and hard) and high probability (indicating that the combination of features and values associated with this discrimination occurred often in the dataset). We anticipated that the features and values associated with the path from the base of the tree to these nodes may suggest simple rules for determining easy from hard for a significant subset of the data.

6. Conclusion The results reveal that the marginal benefit gained from additional annotators in the case of problems with no certain labels depends on the difficulty of the instance classification problem, leading to the conclusion that the ability to separate easy and hard to classify cases can lead to better use of annotation resources. In our particular case, we were able to reduce the overall annotation cost by 20% without suffering any loss of accuracy for our decision tree predictor. Even for a constrained budget (less than a half of all annotations, approximately 45%), we were able to minimize the accuracy loss better than random shedding of annotations. Furthermore, the use of more than three annotators to label easy lung nodules appears to provide little benefit, while in the case of hard to classify nodules, additional annotators gradually increased the classification accuracy. No single rule exists, however – marginal accuracy gains depend both on the instance difficulty and the threshold value used. Our results on the LIDC dataset also show that with just a few low-level image features (18% of the original set) we are able to determine the easy from hard cases for a significant subset (66%) of the lung nodule image data. This will allow automate partitioning of incoming new cases and provide helpful feedback to the radiologists based on image features. As far as overall accuracy, our technique of using simple decision trees is an initial step to understanding the problem. It would be informative to see how the results generalize to other classification approaches in addition to the decision trees and how more sophisticated techniques, such as ensemble methods, will improve the overall accuracy of the classifier. Furthermore, it will

be interesting to explore how the uncertainty in the interpretation of the other semantic characteristics (such as texture, subtlety, spiculation, and lobulation) will impact the prediction models for both malignancy and level of diagnostic complexity. Another expansion of this work will be to inclusion of the image segmentation uncertainty in the prediction models given the existent variability in the boundaries delineated by the radiologists. Finally, it will also be interesting to investigate the extension of the number of annotators as well as investigating finer gradations of easy/hard thresholds (e.g., easy, medium, hard).

Conflict of interest statement None declared.

References [1] Investing in the Clinical Radiology Workforce – The quality and efficiency case. Technical Report, June 2012, 〈https://www.rcr.ac.uk/docs/radiology/pdf/RCR_ CRWorkforce_June2012.pdf〉. [2] A.J. Asman, B.A. Landman, Robust statistical label fusion through consensus level, labeler accuracy, and truth estimation (COLLATE), in: Proceedings of the IEEE Transactions on Medical Imaging, 2011. [3] P.J.A. Robinson, Radiology's Achilles' heel: error and variation in the interpretation of the Rontghen image, Br. J. Radiol. 70 (1997) 1085–1098. [4] J.M. Garibaldi, et al., Incorporation of expert variability into breast cancer treatment recommendation in designing clinical protocol guided fuzzy rule system models, J. Biomed. Inform. 45 (3) (2012) 447–459. [5] S. Voisin, F. Pinto, S. Xu, G.M. Ducote, K. Hudson, et al., Investigating the association of eye gaze pattern and diagnostic error in mammography , in: Proceedings of the SPIE 8673, Medical Imaging 2013: Image Perception, Observer Performance, and Technology Assessment, 867302, March 28, 2013, http://dx.doi.org/10.1117/12.2007908. [6] S.G. Armato III, G. McLennan, L. Bidaut, M.F. McNitt-Gray, C.R. Meyer, A.P. Reeves, B. Zhao, D.R. Aberle, C.I. Henschke, E.A. Hoffman, E.A. Kazerooni, H. MacMahon, E.J.R. van Beek, D. Yankelevit, et al., The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans, Med. Phys. 38 (2011) 915–931. [7] G. Tourassi, S. Voisin, V. Paquit, E. Krupinski, Investigating the link between radiologists' gaze, diagnostic decision, and image content, J. Am. Med. Inform. Assoc. 20 (2013) 1067–1075. http://dx.doi.org/10.1136/amiajnl-2012-0015031067. [8] S. Siena, O. Zinoveva, D. Raicu, J. Furst, S. Armato,A shape-dependent variability metric for evaluating panel segmentations with a case study on LIDC data, in: SPIE Medical Imaging, San Diego, California, February 13–18, 2010. [9] S.K. Warfield, et al., Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation, IEEE Trans. Med. Imaging 23 (2004) 903–921 (July). [10] Mechanical Turk: 〈https://www.mturk.com/〉. [11] Games with a Purpose: 〈http://www.gwap.com/〉. [12] reCAPTCHA: 〈http://recaptcha.net/〉. [13] Galaxy Zoo: 〈http://galaxyzoo.org/〉. [14] LabelMe: 〈http://labelme.csail.mit.edu/Release3.0/〉. [15] V.S. Sheng, F. Provost, P.G. Ipeirotis, Get another label? improving data quality and data mining using multiple, noisy labelers, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '08), ACM, New York, NY, USA, 2008, pp. 614–622. [16] B. Kanefsky, N.G. Barlow, V.G. Gulick, Can distributed volunteers accomplish massive data analyses tasks? in: Proceedings of the Lunar and Planetary Science Conference, 2001.

J.R. Zamacona et al. / Computers in Biology and Medicine 62 (2015) 294–305

[17] A. Kapoor, R. Greiner, Learning and classifying under hard budgets, in: Proceedings of the 16th European Conference on Machine Learning, SpringerVerlag, Berlin, Heidelberg, 2005, pp. 170–181. [18] P. Melville, M. Saar-Tsechansky, F. Provost, R. Mooney, An expected utility approach to active feature-value acquisition, in: Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05), IEEE Computer Society. Washington, DC, USA, 2005, pp. 745–748. [19] S. Vijayanarasimhan, K. Grauman, Cost-sensitive active visual category learning, Int. J. Comput. Vis. (IJCV) 91 (1) (2011) 24–44. [20] A. Foncubierta-Rodriguez, H. Muller, Crowdsourcing opportunities in medical imaging, in: Proceedings of Multimedia Communnications Technical Committee of IEEE Communications Society, vol. 7, 7, September 2012. [21] M.E. Ramirez-Loaiza, A. Culotta, M. Bilgic, Towards anytime Active learning: interrupting experts to reduce annotation costs, in: Proceedings of the 19th Conference on Knowledge Discovery and Data Mining, Chicago, IL, August 2013. [22] A.G. Schwing, C. Zach, Y. Zheng, M. Pollefeys, Adaptive random forest – How many “experts” to ask before making a decision? in: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11), IEEE Computer Society, Washington, DC, USA, 2011, pp. 1377–1384. [23] M. Li, Z.H. Zhou, Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples, Trans. Syst. Man Cybern. Part A 37 (6) (2007) 1088–1098. [24] G.M. Weiss, F. Provost, Learning when training data are costly: the effect of class distribution on tree induction, J. Artif. Intell. Res. 19 (1) (2003) 315–354. [25] A.A. Bankier, D. Levin, E.F. Halpern, H.Y. Kressel, Consensus interpretation in imaging research: is there a better way? Radiology 257 (2010) 14–17. [26] D. Zinovev, Y. Duo, D.S. Raicu, J. Furst, S.G. Armato, Consensus versus disagreement in imaging research: a case study using the LIDC database, J. Digit. Imaging 25 (2012) 423–436. [27] Y. Kawata, N. Niki, H. Ohmatsu, et al., Computerized analysis of 3-D pulmonary nodule images in surrounding and internal structure feature spaces, in: Proceedings of IEEE International Conference on Image Processing (ICIP '01), vol. 2, October 2001, pp. 889–892. [28] K. Mori, N. Niki, T. Kondo, et al., Development of a novel computer-aided diagnosis system for automatic discrimination of malignant from benign solitary pulmonary nodules on thin section dynamic computed tomography,, J. Comput. Assist. Tomogr. 29 (2) (2005) 215–222. [29] M.F. McNitt-Gray, E.M. Hart, N. Wyckoff, J.W. Sayre, J.G. Goldin, D.R. Aberle, A pattern classification approach to characterizing solitary pulmonary nodules imaged on high resolution CT: preliminary results, Med. Phys. 26 (6) (1999) 880–888. [30] M. Aoyama, Q. Li, S. Katsuragawa, F. Li, S. Sone, K. Doi, Computerized scheme for determination of the likelihood measure of malignancy for pulmonary nodules on low-dose CT images, Med. Phys. 30 (3) (2003) 387–394. [31] S.K. Shah, M.F. McNitt-Gray, S.R. Rogers, et al., Computer-aided characterization of the solitary pulmonary nodule using volumetric and contrast enhancement features, Acad. Radiol. 12 (10) (2005) 1310–1319. [32] S. Iwano, T. Nakamura, Y. Kamioka, M. Ikeda, T. Ishigaki, Computer-aided differentiation of malignant from benign solitary pulmonary nodules imaged by high-resolution CT, Comput. Med. Imaging Graph. 32 (5) (2008) 416–422.

305

[33] T.W. Way, B. Sahiner, H.P. Chan, et al., Computer-aided diagnosis of pulmonary nodules on CT scans: improvement of classification performance with nodule surface features, Med. Phys. 36 (7) (2009) 3086–3098. [34] D. Zinovev, J.D. Furst, D.S. Raicu, Building a ennsemble of probabilistic classifiers for lung nodule interpretation, in: Proceedings of the tenth International Conference on Machine Learning and Applications (ICMLA'11), December 18–21, 2011. [35] D. Zinovev, J. Feigenbaum, D.S. Raicu, J.D. Furst, Probabilistic lung nodule classification with belief decision trees, in: Proceedings of the 33rd Annual International IEEE Engineering in Medicine and Biology Conference (EMBC'11), August 30–September 3, 2011, pp. 4493–4498. [36] C.I. Henschke, D.F. Yankelevitz, I. Mateescu, D.W. Brettle, T.G. Rainey, F.S. Weingard, Neural networks for the analysis of small pulmonary nodules, Clin. Imaging 21 (6) (1997) 390–399. [37] Y. Matsuki, K. Nakamura, H. Watanabe, et al., Usefulness of an artificial neural network for differentiating benign from malignant pulmonary nodules on high-resolution CT: evaluation with receiver operating characteristic analysis, Am. J. Roentgenol. 178 (3) (2002) 657–663. [38] K. Nakamura, M. Yoshida, R. Engelmann, et al., Computerized analysis of the likelihood of malignancy in solitary pulmonary nodules with use of artificial neural networks, Radiology 214 (3) (2000) 823–830. [39] H. Chen, Y. Xu, Y. Ma, B. Ma, Neural network ensemble based computer-aided diagnosis for differentiation of lung nodules on CT images: clinical evaluation, Acad. Radiol. 17 (5) (2010) 595–602. [40] A. El-Baz, G. Gimel'farb, R. Falk, D. Heredia, M. Abo El-Ghar, A novel approach for accurate estimation of the growth rate of the detected lung nodules, in: Proceedings of the 1st International Workshop on Pulmonary Image Analysis, New York, NY, USA, September 2008, pp. 33–42. [41] A. El-Baz, G. Gimel'farb, R. Falk, M. Abou El-Ghar, A new approach for automatic analysis of 3D low dose CT images for accurate monitoring the detected lung nodules, in: Proceedings of IARP International Conference on Pattern Recognition (ICPR'08), Tampa, Fla, USA, December 2008, pp. 1–4. [42] F. Zhang, Y. Song, W. Cai, M.Z. Lee, Y. Zhou, H. Huang, S. Shan, M. Fulham, D. Feng, Lung nodule classification with multi-level patch-based context analysis, IEEE Trans. Biomed. Eng. 61 (4) (2014) 1155–1166. [43] K. Furuya, S. Murayama, H. Soeda, et al., New classification of small pulmonary nodules by margin characteristics on high resolution CT, Acta Radiol. 40 (5) (1999) 496–504. [44] D.S. Raicu, E. Varutbangkul, J.D. Furst, S.G. Armato III, Modeling semantics from image data: opportunities from LIDC (Special issue warehousing and mining complex data: applications to biology, medicine, behavior, health and environment), Int. J. Biomed. Eng. Technol. 3 (Nos. 1–2) (2010) 83–113. [45] G. Dasovich, R. Kim, D. Raicu, J. Furst, A model for the relationship between semantic and content based similarity using LIDC, in: SPIE Medical Imaging, San Diego, California, February 13–18, 2010. [46] A. El-Baz, G.M. Beache, G. Gimel'farb, et al., Computer-aided diagnosis systems for lung cancer: challenges and methodologies, Int. J. Biomed. Imaging (2013), http://dx.doi.org/10.1155/2013/942353 (Article ID 942353, 46 pp.). [47] D. Zinovev, D.S. Raicu, J.D. Furst, S.G. Armato III, Predicting radiological panel opinions using a panel of machine learning classifiers, Algorithms 2 (2009) 1473–1502.

©2015 Elsevier

Assessing diagnostic complexity: An image feature-based strategy to reduce annotation costs.

Computer-aided diagnosis systems can play an important role in lowering the workload of clinical radiologists and reducing costs by automatically anal...
1012KB Sizes 0 Downloads 9 Views