Computerized Medical Imaging and Graphics 42 (2015) 16–24

Contents lists available at ScienceDirect

Computerized Medical Imaging and Graphics journal homepage: www.elsevier.com/locate/compmedimag

Cell words: Modelling the visual appearance of cells in histopathology images Korsuk Sirinukunwattana a,b,1 , Adnan M. Khan a,b,1 , Nasir M. Rajpoot a,b,∗ a b

College of Engineering, Qatar University, Doha, Qatar Department of Computer Science, University of Warwick, UK

a r t i c l e

a b s t r a c t

i n f o

Article history: Received 4 April 2014 Received in revised form 29 May 2014 Accepted 10 November 2014 Keywords: Dictionary learning Visual appearance modelling of cells Mitotic cell detection Histopathology image analysis

Detection and classification of cells in histological images is a challenging task because of the large intraclass variation in the visual appearance of various types of biological cells. In this paper, we propose a discriminative dictionary learning paradigm, termed as Cell Words, for modelling the visual appearance of cells which includes colour, shape, texture and context in a unified manner. The proposed framework is capable of distinguishing mitotic cells from non-mitotic cells (apoptotic, necrotic, epithelial) in breast histology images with high accuracy. © 2014 Elsevier Ltd. All rights reserved.

1. Introduction Detection of cells in histopathology images is a subject of interest in a wide range of cell-based studies as it is critical for evaluating the existence of disease and its severity. For example, infiltration of lymphocytes in breast cancer has been shown to be related to patient survival [1]. Similarly, nuclear pleomorphism has diagnostic value for cancer grading [2,3] and mitotic count is an important prognostic parameter in breast cancer grading [4]. Due to diverse variation in the visual appearance of cells, automated systems often require methods that adapt well to different kinds of cells. The difficulty of the problem increases significantly when the cell density of tissue sample is high, resulting in cell overlap and clumping. Moreover, in some applications, the visual appearance of cell types is quite similar to the visual appearance of some other structures present in the same image, posing a great challenge which is often very hard to address with classical image processing techniques. Traditional approaches to cell detection can be broadly divided into five main categories [5]: intensity based, region based, active contours/level sets, probabilistic and graphical models. (1) Intensity based methods (e.g. thresholding) search for optimal value

∗ Corresponding author. Tel.: +44 (24) 7698 0999. E-mail addresses: [email protected] (K. Sirinukunwattana), [email protected] (A.M. Khan), [email protected] (N.M. Rajpoot). 1 Joint first authors. http://dx.doi.org/10.1016/j.compmedimag.2014.11.008 0895-6111/© 2014 Elsevier Ltd. All rights reserved.

of intensity such that the intensities that belong to cell are separated from the intensities that belong to background; (2) region based methods (e.g. region growing) normally work on the intensity as well, however they follow the basic principle of merging two neighbouring regions if the regions satisfy some predefined merging criterion; (3) active contour models are deformable splines that can be used to depict the contour of nuclei in an image using gradient information; (4) probabilistic models (e.g. Gaussian mixture models) represent cells in histopathology images as weighted sum of several Gaussian densities or as a mixture of Gaussian and other (e.g. Gamma) densities [6]. The parameters of these models are usually estimated from training data using parameter estimation techniques such as expectation maximisation [7]; (5) graphical models (e.g. graph cuts) conceptualise images as graphs, where each pixel is represented by a node in the graph and the relationship between neighbouring pixels is represented by edges. These methods partition the graph into disjoint subgraphs such that similarity is high within the subgraphs and low across different subgraphs. In this paper, we propose a model for visual appearance of cells in histopathology images. Rather than using computationally intensive models for cell detection (like active contours and graphical models), we propose a discriminative dictionary learning (DDL) based paradigm to model the visual appearance of cells that intrinsically takes into account various features including colour, shape, texture, and context of cells. The proposed model aims at learning a dictionary (with class-specific atoms) that simultaneously has both good reconstruction and high discrimination ability. Moreover, it exploits some of the attractive properties of sparse methods

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24

17

Fig. 1. Visual appearance of different cells in breast histopathological images. First four images (from left) are mitotic cells and last four images are non-mitotic cell images.

(e.g. robustness to noise and succinct representation of the data) in order to learn discriminative representations of complex biological objects. The proposed model is evaluated on publicly available MITOS dataset [8] for detection of mitotic cells (MCs) in breast histopathology images. Detection and quantification of MCs in histopathology images is an important diagnostic indicator for many cancers including breast cancer. It is a challenging task because of the large intra-class variation in the visual appearance of MCs (see Fig. 1). Additionally, if standard haematoxylin and eosin (H&E) staining is used (which stains chromatin rich structures, such as nucleus, apoptotic cells and MCs in dark blue colour), it becomes extremely challenging to detect the MCs as shown in Fig. 1. Majority of the previously proposed approaches for MCs detection work by first identifying candidate objects or locations that are then accepted (or rejected) as MCs based on some similarity criterion [9]. The candidate extraction phase often makes use of the colour distinctiveness of MCs by building statistical models [6] or by performing thresholding followed by refining of the detected regions by morphological operations and/or active contours segmentation [10,11]. In the second stage, more specialised features ranging from morphological, geometrical and textural features to more specialised features learnt from deep convolutional neural networks are used to train a classification framework [6,10–13]. The approach proposed in [6] is unique in the sense that it uses tumor segmentation to reduce the search space of mitotic cells [14,15,41]. Our main contributions are as follows: (1) we present a framework for modelling the visual patterns in histopathology images and apply this model to discriminate between different kinds of cells; (2) we present a formulation for DDL that is computationally efficient and yields high accuracy; (3) we present a systematic method for selecting the optimal values for algorithm’s parameters; (4) we demonstrate the effectiveness of the proposed model on publicly available MITOS dataset [8] to perform mitotic cell detection. 2. Related work Recently, some DDL algorithms have been proposed in literature. Discriminative K-SVD [16] uses a linear regression term in a dictionary learning objective function which penalises nondiscriminative atoms. The by-product of the learning process is a linear classifier that can be directly applied to the learnt sparse code. Label consistent K-SVD [17] adds a label consistent term into the objective function of the discriminative K-SVD method. The label consistent term encourages the use of atoms with the same label to reconstruct a data point. Structured incoherent dictionary learning [18] integrates a term into the objective function that minimises the covariance between atoms of different classes, so as to circumvent the overlapping of atoms from different classes. Fisher DDL [19], the closest work to the proposed model, employs Fisher discriminant criterion to minimise within-class variation and maximise between-class variation of sparse codes. However, the Fisher term, which makes insignificant improvement in the classification accuracy of the model, is computationally expensive to compute, thus makes the model inefficient for histological images. The success of these DDL methods is mainly attributed to following two characteristics of sparse models: (1) sparsity, a mechanism

for regularising the coefficients that represent the data; and (2) the dictionary, that is learnt directly from the data. These characteristics are controlled by two critical parameters: the desired sparsity of the coefficients, and the size (in terms of number of atoms) of the dictionary. However, lacking theoretical guidelines to select these parameters, most of the DDL methods require hand-tuning and cross-validation to select optimal parameters, which is often time consuming and ineffective. In this study, we use a more principled approach to select both parameters. 3. The proposed model Fig. 2 provides an overview of the proposed model, which consists of two phases: (1) dictionary learning; (2) classification. During dictionary learning phase, a dictionary is learnt from the training data by solving an optimisation problem outlined in Section 3.1. During classification, a test sample is first encoded over the learnt dictionary (i.e. represented as a linear combination of few atoms from the learnt dictionary), and then the reconstruction residual based on the code is used to determine the class of the test sample (Section 3.2). Note that, in order to keep the formulation simple, the proposed model uses two class formulation, however, the proposed formulation can be easily extended for more than two classes. 3.1. Dictionary learning Let c ∈ {C1, C2} denotes a class index, where C1 and C2 represent two different classes of cells. Let X = [XC1 , XC2 ] ∈ Rm×N , with Xc = [xc,1 , . . ., xc,nc ] ∈ Rm×nc be a matrix whose columns are patches from class c, m be the number of pixels in a patch, N be the total number of training samples, and nc be the number of training samples of class c. Let D = [DC1 , DC2 ] ∈ Rm×K , with Dc = [dc,1 , . . ., dc,kc ] ∈ Rm×kc be the dictionary for class c containing kc atoms, and K be the total number of atoms. Let A = [AC1 , AC2 ] ∈ RK×N , with Ac = [˛c,1 , . . ., ˛c,nc ] ∈ RK×nc be the sparse code matrix corresponding to Xc . Our task is to solve the following optimisation problem: min D,A



(Xc , D, Ac ) + Ac 1 (1)

c

subject to dTc,j dc,j = 1 ∀j = 1, . . ., kc , where (c) 2

(Xc , D, Ac ) = Xc − DAc 2F + Xc − Dc Ac F + 



(l) 2

Dl Ac F ,

(2)

l= / c

 ·  1 , ·  F denote the l1 norm, and the Frobenius norm of a matrix, (l) respectively, Ac denotes a sub-matrix of Ac that corresponds to Dl , and  is sparsity regularisation parameter. In Eq. (2), the first term refers to the reconstruction error of Xc using the whole dictionary, the second term refers to the use of Dc for reconstructing Xc , while the last term refers to the prevention of reconstructing Xc using dictionary atoms from other classes, penalised by a parameter . Note that the formulation above is simpler and efficient than the one proposed in [19] as we are not calculating any Fisher term, which is computationally very inefficient to compute and yields insignificant improvement.

18

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24

Fig. 2. The proposed paradigm for learning the visual appearance of cells.

The objective function (1) is not jointly convex in both A and D, but it is convex in A if D is fixed and vice versa. To optimise (1), we follows the traditional way of solving dictionary based classification problem where we decouple (1) into two convex optimisation problems, namely sparse code optimisation and dictionary optimisation. The optimisation procedure alternates between the two sub-problems until the convergence criterion is reached.

 = (a + b)/4 where a and b are the smallest and the largest eigenvalues of the Hessian of . The TwIST method aims at improving convergence rate of the map (4), especially when (3) is ill-conditioned. The method considers the map:

3.1.1. Sparse code optimisation Given class c, we fix D and Al , l = / c. Thus, the optimisation problem (1) becomes a convex optimisation problem:

where ı, ˇ > 0. This equation is in fact equivalent to (4). However, it also incorporates the information from the past and the present in order to calculate the next point. Parameters ı and ˇ play important roles on the convergence and the convergence rate of the algorithm. The algorithm converges to a unique fixed point which is a minimiser of (3) when

min (Xc , D, Ac ) + Ac 1 ,

(3)

Ac

also known as lasso [20]. A number of efficient algorithms to solve (3) have been proposed such as the interior point method [21], coordinate descent method [22], iterative thresholding method [23], and LARS/homotopy method [24,25]. Here, we use iterative projection method [26] in combination with two-step iterative shrinkage/thresholding (TwIST) method [27]. At each iteration, iterative projection method optimises (3) by considering a fixedpoint equation: As+1 = (Asc ) ≡ S c where

Asc



 2

Asc −



1 ∇ (Xc , D, Asc ) , 2

(4)

denotes the sparse code matrix at the sth iteration,  ( · ) is

∇ (Xc , D, Asc ) denotes the gradient of  with respect to Asc . S

S

 2

(A)ij = sign(˛ij ) max

 0, |˛ij | − 2



,

(5)

where ˛ij is the (i, j) element of A. Essentially, the method employs a gradient descent technique to search for the optimal solution. The search is done in the steepest descent direction − ∇  with the 1 step size 2 . The new search point is then regularised to conform with the l1 norm regularisation using the soft-thresholding function. Since  is strictly convex and differentiable, the map (4) is guaranteed to converge to a unique fixed point which minimises (3), given an appropriate value of  (Theorem 1 in [26]). The optimal prior choice of parameter  that makes the map (4) contract is

  2

Asc −



1 ∇ (Xc , D, Asc ) , (6) 2

ı = 2 + 1,

(7)

2ı , b˜ + a

(8)

ˇ=





˜ where b˜ = max 1, b ,  = a/b, (Theorem 4, [27]).

and

 = (1 −

√ √ )/(1 + )

3.1.2. Dictionary optimisation To update a dictionary of class c, Dc , we fix A as well as Dl , l = / c. This results in that the optimisation problem (1) becomes

2

a soft-thresholding function [28], defined element-wise on a matrix A as follows:



As+1 = (1 − ı)Asc + (ı − ˇ)Asc + ˇS c

(c) 2

minXc − DAc 2F + Xc − Dc Ac F +  Dc



(c) 2

Dc Al F

l= / c

(9)

subject to dTc,j dc,j = 1 ∀j = 1, ..., kc . We optimise (9) by updating atoms dc,j , j = 1, . . ., kc in sequence using block coordinate descent method [29,30], where for each atom j, a Lagrangian: (c) 2

L(dc,j , ) = Xc − DAc 2F + Xc − Dc Ac F +

 l= / c

(c) 2

Dc Al F + (1 − dTc,j dc,j ),

(10)

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24

with be a Lagrangian multiplier, is minimised. Solving for dc,j that minimises (10), we obtain d∗c,j =

u , u2

(11)

where T

(c)



(c)

(Dc Al

T

ˆ l,j )˛ ˆ l,j , + dc,j ˛

(12)

l= / c

ˆ l,j denotes the jth row of Al . and ˛ 3.1.3. Learning the cell words The proposed method for learning the cell words can be summarised in the following four steps. 1. Dictionary initialisation: for each class c, perform PCA on Xc where patches are treated as variables and pixels are treated as observations. Use the first kc − 1 principal components and the mean of all patches in Xc to initialise the dictionary D0c . 2. Sparse code initialisation: for each class c, initialise sparse code A0c by solving A0c = arg minA Xc − Dc A2F subject to ˛i  ≤ L

4. Results and discussion

∀i = 1, . . ., nc

4.1. Materials

Steps (3) and (4) are repeated until stopping criterion such as a number of iterations or a tolerance on the cost of the objective function is reached. 3.2. Classification ˛*

of x using the whole learnt dic-



x − D˛22 + ˛1 ,

(14)

where is a sparsity regularisation parameter. Then, we assign a class label for x based on the reconstruction error of different classspecific dictionaries as follows: c ∗ = argminc∈{C1,C2} x − Dc ˛∗ 2 .

(15)

Dictionary optimisation via block coordinate Algorithm 1. descent method Input: Xc data from class c, Dt an initial dictionary, Atc a sparse code matrix of class c Output: Dt+1 an optimised dictionary for class c c Define:

1. T

We evaluate our method on the publicly available MITOS dataset for mitosis detection in breast cancer histopathology images [8]. The dataset includes 50 images corresponding to high-power fields (HPFs) in five different biopsy slides stained with H&E. More than 325 MCs are visible in the MITOS dataset. Each HPF represents a 512 × 512 ␮m2 area, and is acquired using three different setups: two slide scanners and a multispectral microscope. Here we consider images acquired by the two slide scanners: Aperio XT and Hamamatsu. Aperio HPFs have a resolution of 0.2456 ␮m per pixel, resulting in a 2084 × 2084 RGB image. Hamamatsu HPFs have a horizontal and vertical resolution of 0.2273 and 0.22753 ␮m per pixel, resulting in a 2252 × 2250 RGB image. Expert pathologists manually annotated all visible MCs. 4.2. Preprocessing

We calculate a sparse code tionary D:



Input: Xc data from class c, Dt+1 a dictionary, Atc an initial sparse code for class c, T the total number of iterations,  sparse regularisation parameter an optimised sparse code for class c Output: At+1 c Define: a cost function g(Atc ) = (Xc , D, Atc ) + Atc 1 As ← Atc for s = 1 to T do ˜ ← (1 − ı)As−1 + (ı − ˇ)As + ˇ(As ) A ˜ ≤ g(As ) then if g(A) s+1 ˜ A ←A else As+1 ← (As ) end if end for At+1 ← AT +1 c

(13)

in which A = [˛1 , . . ., ˛nc ] and L is the number of non-zeros elements in ˛i , via orthogonal matching pursuit method [31]. 3. Dictionary optimisation: for each class c, find unused atoms in Dtc and replace them with randomly selected patches from the same class. Find Dt+1 that optimises (9) using block coordinate c descent method described in Algorithm 1. 4. Sparse code optimisation: for each class c, find At+1 that optic mises (3) using Algorithm 2.

˛∗ = argmin˛∈Rm

t(c) T

3. B = Xc (Ac ) = [b1 , . . ., bkc ] ∈ Rm×kc 4. Dtc = [dc,1 , . . ., dc,kc ] t be a submatrix of D corresponding to class c for j = 1 to kc do ˜ )+ ˜ ) (Dtc a˜ l,j + dc,j  u ← (2bj − Dt aj − Dtc a˜ c,j + 2dc,j  c,jj l,jj l= / c dc,j ← u/ u  2 end for Algorithm 2. Sparse code optimisation via iterative projection method and two-step iterative shrinkage/thresholding method

T

ˆ c,j + (Xc − Dc Ac + dc,j ˛ ˆ c,j ˆ c,j )˛ ˆ c,j )˛ u = (Xc − DAc + dc,j ˛ +

19

t(c) T

 = Atc (Ac ) = [a1 , . . ., akc ] ∈ RK×kc

2.

˜ = At(c) (At(c) ) = [˜a , . . ., a˜ ] ∈ Rkc ×kc , l is a class index  l l,1 l,kc l l

4.2.1. Candidate extraction Before performing candidate detection, we perform stain normalisation [32] as a preprocessing step to neutralise the variation in visual appearance of tissue sections. Next, we segment potential mitosis candidates by first applying Gaussian blurring to red channel of image to remove noise before thresholding using globally fixed threshold. We use free-response receiver operating characteristic curve analysis in order to select a suitable value for the threshold. After the segmentation has been performed, all connected regions with an area between 10 ␮m2 and 100 ␮m2 are considered as candidate objects. The range of sizes is estimated by examining the distribution of size of the annotated MCs in the training set. Regions outside this area range are highly unlikely to correspond to mitotic figures or parts of them. Lastly, if centroids of the two candidate MCs are less than 4.5 ␮m (20 pixels) apart, one of the two candidates is removed. Most of MCs in MITOS dataset normally fall in the range of 30 × 30 to 60 × 60 pixels. In order to incorporate the surrounding context along with the candidate MCs, we choose a patch size

20

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24

of 81 × 81 (i.e. 10 pixels margin on each side of the image) as a trade-off between the amount of context and the computation burden. However, since we perform nuclei alignment (described in Section 4.2.2), which may involve rotation of candidate patches, an extra padding of 20 pixels is added on all sides of the candidate patches so as to compensate for this rotation. Therefore, patches of size 121 × 121 pixels around the centroids of the candidate MCs are extracted during candidate detection, and after nuclei alignment (Section 4.2.2), only the central 81 × 81 part is used in the subsequent steps. Fig. 1 shows some of the candidates extracted from images obtained from the Aperio scanner using the procedure outlined above. Note that in all experiments, parameters tuned for images from the Aperio scanner were used as is on images from Hamamatsu scanner.

4.2.2. Nuclear alignment Intensity features are neither translation-invariant nor rotationinvariant. As indicated by [33–35], the classical dictionary learning framework is sensitive to the translations and rotations inherent in the dataset. Without nuclei alignment, some atoms in the learnt dictionary may just be simple translations of each other. Therefore, translation and rotation of patches is carried out to ensure that all the candidate MCs are aligned at the centre. The candidate patch extraction step results in the patches having nuclei in the centre, but nuclei are not oriented. We rotate patches using principal component analysis (PCA). First, we perform nuclear segmentation on blue ratio intensity [36]: BR = 100



B 1+R+G



256 1+R+G+B



.

(16)

Blue ratio intensity indicates spatial distribution of nuclear content on a patch. Only a nucleus whose centroid is the closest to the centre of a patch is considered. PCA is then performed on coordinates of nuclear pixels, yielding a rotation matrix in the form of principal component coefficients. A patch is rotated around its centre point in the counter-clockwise direction. Finally, patches are cropped around its centre point to have the size of 81 × 81 pixels. 4.3. Experimental setup The main focus of this paper is modelling the appearance of cells, therefore after performing candidate detection (as outlined in Section 4.2.1), we divide all the candidate patches into two sets: Strain which contains roughly 70% of total candidate patches and Stest which contains roughly 30% of total candidate patches. Blue ratio intensities from candidate patches are used as inputs to the proposed DDL method. To make the input data consistent with the dictionary optimisation procedure that does not constrain coefficients of atoms to be non-negative, all the data are linearly transformed to have zero median. We randomly split 70% and 30% of candidate patches from the patches in Strain into Xtrain and Xvalidate , respectively. Xtrain is used in the training procedure of the proposed DDL method, while Xvalidate is used to monitor the classification performance at each iteration and to avoid over-fitting. A stopping criterion for the algorithm is that the relative difference on the cost of the objective function for the consecutive iterations, i.e. |costt+1 − costt | |costt − costt−1 |

(17)

is less than 10−6 . We select the dictionary from the iteration that gives the highest classification performance on Xvalidate and perform classification of patches from Stest .

The degree of sparseness of a sparse code ˛ can be measured by [34] √ n − ˛1 /˛2 Sp(˛) = (18) , √ n−1 where  ·  1 and  ·  2 denote l1 -norm and l2 -norm, respectively. This measure ranges between 0 and 1, and the more sparse the vector, the higher the measure. We manually select regularisation parameters  for training, and for classification such that a certain level of average sparseness is attained when the algorithm converges. In the experiments, we consider two levels of sparseness, namely 0.7 and 0.8. For the setup with the level of sparseness of 0.7, we set the parameter L in sparse code initialisation step to be 30% of the total number of the dictionary atoms, K. We set L to be 20% of K in the setup with sparseness of 0.8. In all experiments, parameter  in (2) is set to 5. 4.4. Evaluation In order to reduce the effect from random partition of Strain into Xtrain and Xtest , we run an experiment for 10 repetitions. For each repetition i, we generate the set of detections for Stest and count the number Ntp,i of true positives (i.e. detections whose centroids are closer than 8 ␮m from the ground truth centroid), false positives (Nfp,i ) and false negatives (Nfn,i ). The total numbers of true positives, false negatives, and false positives are given by Ntp =

10

10

10

i=1

Ntp,i ,

N , and Nfp = N , respectively. The following Nfn = i=1 fn,i i=1 fp,i performance measures are calculated: recall R = Ntp /(Ntp + Nfn ), precision P = Ntp /(Ntp + Nfp ) and F1-score F1 = 2PR/(P + R). 4.5. Experimental results and discussion Fig. 3 illustrates examples of learnt cell words from mitotic and non-mitotic classes. It is apparent that the appearance of the cells as well as the surrounding texture are being captured by the learnt cell words. Some of the mitotic cell words resemble the appearances of cells undergoing different phases in cell division process. Although the cell words from mitotic and non-mitotic classes were learnt from BR intensities, where nuclear structures appear bright and everything else dark, images shown in Fig. 3 have different patterns of nuclei and cytoplasm regions. This is mainly because of the dictionary update mechanism, which does not constrain the atoms to be strictly positive. As explained in Section 3.1, reconstruction of these cells is performed by the linear combination of these atoms. Therefore, the combination of these atoms is important, not the individual atoms themselves. In Fig. 3, the atoms that have white cytoplasm and dark nuclei mean that they act as a negative term in a linear combination. Fig. 4 shows some example of MCs that were either undetected (false negatives) or falsely detected (false positives). Notice that false positive MCs are very close in appearance to MCs while false negatives are either very weakly stained, situated on the corner of the image, where surrounding texture is partially missing or shape of MCs less prevalent in the given dataset. In the following subsections, we present a range of experiments to evaluate the performance of the proposed algorithm. 4.5.1. Experiment 1: effect of algorithm’s parameters on classification performance of the proposed algorithm The two main parameters in the proposed DDL algorithms are (1) the sparseness of the reconstruction code, controlled by the parameter  and (2) the number of dictionary atoms per classes provided by kc . In reconstruction problem, the higher the number of atoms per class, the better the variation in the data can be

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24

21

Fig. 3. (a) Snapshot of 40 mitotic cell words found using the proposed method; (b) 40 non-mitotic cell words found using the same approach. Note that the mitotic cell words in (a) are representative of various phases of cell division life cycle, e.g. consider the 7th figure in the 1st row, the 3rd figure in the 3rd row, and the 1st figure in the last row, the mitotic cell words noticeably mimic metaphase, prophase, and anaphase of cell division life cycle. The greyscale images in (a) and (b) represent the blue ratio intensities.

22

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24

Fig. 4. (Top row) Snapshot of six mitotic cells un-detected by the proposed method (false negatives). (Bottom row) Snapshot of six non-mitotic cells mis-detected by the proposed method (false-positives). Note that false negatives are usually the cells that are either very weakly stained, situated on the corner of the image where surrounding context is partly missing or have appearance less prevalent in the dataset. Also note that false positives have appearance very similar to the appearance of the true mitotic cells.

captured, which leads to the lower reconstruction residual. The level of sparseness, on the other hand, is more related to the representation compactness of the data. High sparsity means very few dictionary atoms are used to represent the given signal. Fig. 5 shows the plot of F1-scores against the numbers of dictionary atoms per class in two settings of sparseness. The results are summarised over 10 repetitions. F1-score has a tendency to increase as the number of atoms per class increase. However, the increase in the performance is insignificant as compared to the increase in the training time. We observed that by introducing more atoms, variations in the cell appearance were being better captured, but at the same time atoms from different classes started to become increasingly redundant. This eventually reduces the discriminative power of the algorithm. In addition, it is more likely that algorithm will get stuck in a local minima when the number of atoms increases. This is quite clear in the case of 60 and 100 atoms, where we can see that the performance does not improve even after significantly increasing the number of atoms. The same phenomenon can be observed in Table 1, where the performance goes down, when the number of atoms in non-mitotic class were increased from 80 to 100. We would also like to emphasise that the scores reported in Fig. 5 are F1-scores, which are quite sensitive to the number of false positives especially considering the fact

0.95 0.9

sparseness 0.8 sparseness 0.7

0.85

F1−score

0.8

Table 1 Effect of number of atoms in mitotic and non-mitotic classes on classification performance of the proposed algorithm. kMC , knMC

Precision

Recall

F1-score

40, 20 40, 40 40, 60 40, 80 40, 100

0.7645 0.8437 0.8498 0.8765 0.8458

0.8861 0.842 0.8718 0.788 0.757

0.8208 0.8428 0.8607 0.8299 0.7989

that the number of true positives in the dataset are very limited compared to the number of false positives. In terms of sparseness, the higher the sparseness of the sparse codes, the fewer atoms are being selected in the reconstruction of the data. This implies the concept of locality representation [37–39], where only a few atoms in the neighbour of the data point should be used to represent that data point. This also enforces atoms to be more close to the data point that they represent in the dictionary optimisation process. As can be seen, F1-scores at sparseness of 0.8 often performs better than the F1-scores at sparseness 0.7. Table 1 presents the classification performance when the number of atoms are unbalanced. Here, we fix the number of atoms of the mitotic class and vary the number of atoms in the non-mitotic class. The reported results is summarised over 10 repetitions. The important observation here is that the precision scores tend to increase when the number of atoms in the non-mitotic class increase. The recall scores exhibit a reciprocal behaviour. This confirms the fact that variation in the appearance of non-mitotic atoms is better captured when the number of atoms increases. Thus, the number of false positives decreases. However, the non-mitotic atoms have also becomes similar to the mitotic atoms, which results in the increase in the number of false negatives.

0.75 0.7 0.65 0.6 0.55 0.5

20

40

60

80

100

number of atoms / class Fig. 5. Classification performance based on the learnt dictionary.

4.5.2. Experiment 2: effect of nuclei alignment on classification performance of the proposed algorithm Here, we demonstrate the effect of nuclei alignment on classification accuracy of the proposed system. We randomly split data into training (70%)/test (30%) sets, choose optimum set of parameters for the algorithm (number of atoms [40, 40] and sparseness [0.8]) and perform discriminative dictionary learning on both nuclei-aligned and unaligned image patches. Learnt dictionaries are later used for classification of test set. The process is repeated 10 times and average results on the test sets are reported in Table 2. As can be seen clearly from the results, we obtain an improvement of 6.42% in terms of F1-score when nuclei-aligned set of image

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24 Table 2 Effect of nuclei alignment on classification performance of the proposed system. Last column presents the p-value for Wilcoxon rank-sum between no-nuclei alignment and nuclei aligned configurations.

No nuclei alignment Nuclei alignment

23

Table 4 Results of discriminative dictionary learning on data from two different scanners using the same set of parameters (i.e. [KMC , KnMC ] = [40, 40] and spareness = 0.8). Average results over 10 random splits of training and test data are reported.

Precision

Recall

F1-score

p-value

Scanner

Precision

Recall

F1-score

0.7904 0.8437

0.792 0.842

0.7912 0.8428

– 8.1 × 10−5

Aperio Hamamatsu

0.8255 0.8284

0.7854 0.7359

0.805 0.7794

patches are used. In order to investigate the significance of this improvement, we perform Wilcoxon rank-sum test of statistical significance, which test the change in median of two populations at a given level of significance. The results obtained from the test with the null hypothesis that there is no change in the median of the F1-scores from the two experimental settings (aligned and un-aligned nuclei) and the alternative hypothesis that the median of the F1-scores from aligned-nuclei image patches is greater than its counterpart at significance level 0.01 indicates statistically significant improvement in the classification performance (p-value = 8.1 × 10−5 ). 4.5.3. Experiment 3: evaluation of the proposed algorithm for different sizes of training and test data To demonstrate the ability of the proposed classification algorithm for different sizes of training and test partitions, we randomly split data into training/test sets using three different configurations (70/30, 50/50 and 30/70), where 70/30 represents a configuration in which 70% of data is used for training and 30% is used for testing, 50/50 represents a configuration in which 50% of data is used for training and 50% is used for testing and 30/70 represents a configuration in which 30% of data is used for training and 70% is used for testing. Next, we choose an optimum set of parameters for the algorithm (number of atoms [40, 40] and sparseness [0.8]) and perform discriminative dictionary learning for each configuration independently. Learnt dictionaries are later used for classification of test set. The process is repeated 10 times for each configuration and the average results on test sets are reported in Table 3. As can be seen clearly from the results on average, we obtain an improvement in terms of F1-score of 2.89% and 4.68% when 70/30 configuration is used instead of 50/50 and 30/70 respectively. In order to investigate the significance of this improvement, we perform Wilcoxon rank-sum test. The results obtained from the test with the null hypothesis that there is no change in the median of the F1-scores from the two experimental settings (70/30 vs. 50/50) and the alternative hypothesis that the median of the F1-scores from 70/30 configuration is greater than its counterpart (50/50) at significance level 0.01 indicates statistically insignificant improvement in the classification performance (p-value = 0.3223). Similarly, the results obtained from the test with the null hypothesis that there is no change in the median of the F1-scores from the two experimental settings (50/50 vs. 30/70) and the alternative hypothesis that the median of the F1-scores from 50/50 configuration is greater than its counterpart (30/70) at significance

Table 3 Performance of the proposed algorithm on training/test datasets of various sizes: each row of the table presents the classification performance measures using training and test dataset of different sizes. Last column presents the p-value for Wilcoxon rank-sum between 70/30 and other configurations. Note that the p-value for the test between 70/30 and 30/70 configurations shows statistically significant improvement at significance level of 0.01, which is expected as increased training data may help classification model to learn less prevalent cases. [Train, test] split

Precision

Recall

F1-score

p-value

[70, 30] [50, 50] [30, 70]

0.8437 0.82 0.7887

0.842 0.8418 0.8221

0.8428 0.8308 0.8051

– 0.3223 0.0045

level 0.01 indicates statistically insignificant improvement in the classification performance (p-value = 0.0149). 4.5.4. Experiment 4: evaluation of the proposed algorithm as per MITOS2012 criterion For this experiment only, we choose Strain and Stest such that Strain and Stest consist of the patches obtained from training and evaluation HPFs provided by the ICPR2012 Mitosis Detection Contest. Strain consists of patches obtained from 35 training images while Stest consists of patches obtained from 15 evaluation images. For details about the list of training and test images, reader is referred to the contest website.2 We choose the same set of parameters for the algorithm (number of atoms [40, 40] and sparseness 0.8) as used in Sections 4.5.2 and 4.5.3 to learn discriminative dictionaries. Learnt dictionaries are later used for classification of test set. The process is repeated 10 times and the average results on test sets are reported in Table 4. Experimental results suggest that the algorithm demonstrates superior performance on data from Aperio scanner compared to data from Hamamatsu scanner. It is mainly because of the fact that all of the parameters are tuned on data from Aperio scanner, and used as is on data from Hamamatsu scanner. Nevertheless, since the algorithm uses BR features, which are more likely to remain consistent regardless of the scanner, the difference in performance of the two scanners is relatively small (3.2%). 5. Conclusions We presented an efficient paradigm for modelling the visual appearance of cells in histopathological images based on discriminative dictionary learning method where the within-class reconstruction error is minimised and the between-class reconstruction error is maximised. The model essentially incorporates various characteristics of image including colour, shape, texture and context in a unified manner. We used the proposed framework for modelling the visual appearance of mitotic cells which can easily be confused with other constructs present in histopathological tissues. Potential future directions include extension of the same paradigm for detecting other types of cells (e.g. lymphocytes) stained using standard H&E or immunohistochemical stains in histopathological images [40]. Acknowledgements Research reported in this publication is partially supported by the Qatar National Research Fund (QNRF) under the award number NPRP 5-1345-1-228 and Qatar University Startup Grant QUSG-CENG-CSE-12/13-2. A.M. Khan acknowledges the financial support provided by the Warwick Postgraduate Research Scholarship (WPRS) program and the Department of Computer Science at the University of Warwick, UK. K. Sirinukunwattana acknowledges the financial support provided by the Department of Computer Science, University of Warwick, UK.

2

http://ipal.cnrs.fr/ICPR2012/

24

K. Sirinukunwattana et al. / Computerized Medical Imaging and Graphics 42 (2015) 16–24

References [1] Alexe G, Dalgin GS, Scanfeld D, Tamayo P, Mesirov JP, DeLisi C, et al. High expression of lymphocyte-associated genes in node-negative HER2+ breast cancers correlates with lower recurrence rates. Cancer Res 2007;67(22):10669–76. [2] Elston C, Ellis I. Pathological prognostic factors in breast cancer: I. The value of histological grade in breast cancer: experience from a large study with longterm follow-up. Histopathology 1991;19(5):403–10. [3] Stierer M, Rosen H, Weber R. Nuclear pleomorphism, a strong prognostic factor in axillary node-negative small invasive breast cancer. Breast Cancer Res Treat 1991;20(2):109–16. [4] Aaltomaa S, Lipponen P, Eskelinen M, Kosma V-M, Marin S, Alhava E, et al. Mitotic indexes as prognostic predictors in female breast cancer. J Cancer Res Clin Oncol 1992;118(1):75–81. [5] Irshad H, Veillard A, Roux L, Racoceanu D. Methods for nuclei detection, segmentation and classification in digital histopathology: a review – current status and future potential. IEEE Trans Biomed Eng 2014;99(99), 9999-9999. [6] Khan AM, ElDaly H, Rajpoot NM. A gamma-Gaussian mixture model for detection of mitotic cells in breast cancer histopathology images. J Pathol Inform 2013;4:11. [7] Moon TK. The expectation-maximization algorithm. IEEE Signal Process Mag 1996;13(6):47–60. [8] Ludovic R, Daniel R, Nicolas L, Maria K, Humayun I, Jacques K, et al. Mitosis detection in breast cancer histological images an ICPR 2012 contest. J Pathol Inform 2013;4(1):8. [9] Veta M, Diest PJV, Willems SM, Wang H, Madabhushi A, Cruz-Roa A, et al. Assessment of algorithms for mitosis detection in breast cancer histopathology images. Med Image Anal 2014, http://dx.doi.org/10.1016/ j.media.2014.11.010 [available online at: http://www.sciencedirect.com/ science/article/pii/S1361841514001807] [10] Irshad H. Automated mitosis detection in histopathology using morphological and multi-channel statistics features. J Pathol Inform 2013;4. [11] Veta M, van Diest P, Pluim J. Detecting mitotic figures in breast cancer histopathology images. In: SPIE medical imaging, international society for optics and photonics. 2013, 867607-867607. [12] Cires¸an DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In: Medical image computing and computer-assisted intervention – MICCAI 2013. Springer; 2013. p. 411–8. [13] Tashk A, Helfroush MS, Danyali H, Akbarzadeh M. An automatic mitosis detection method for breast cancer histopathology slide images based on objective and pixel-wise textural features classification. In: 2013 IEEE 5th conference on information and knowledge technology (IKT). 2013. p. 406–10. [14] Khan AM, El-Daly H, Rajpoot N. RanPEC: random projections with ensemble clustering for segmentation of tumor areas in breast histology images, medical image understanding and analysis (MIUA). Swansea, UK: British Machine Vision Association (BMVA); 2012. p. 17–23. [15] Khan AM, El-Daly H, Simmons E, Rajpoot NM. HyMaP: a hybrid magnitudephase approach to unsupervised segmentation of tumor areas in breast cancer histology images. J Pathol Inform 2013;4(Suppl.). [16] Zhang Q, Li B. Discriminative K-SVD for dictionary learning in face recognition. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2010. p. 2691–8. [17] Jiang Z, Lin Z, Davis LS. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2011. p. 1697–704. [18] Ramirez I, Sprechmann P, Sapiro G. Classification and clustering via dictionary learning with structured incoherence and shared features. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2010. p. 3501–8. [19] Yang M, Zhang L, Feng X, Zhang D. Fisher discrimination dictionary learning for sparse representation. In: 2011 IEEE international conference on computer vision (ICCV). IEEE; 2011. p. 543–50.

[20] Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B: Meth 1996:267–88. [21] Kim S-J, Koh K, Lustig M, Boyd S, Gorinevsky D. An interior-point method for large-scale l1-regularized least squares. IEEE J Sel Top Signal Process 2007;1(4):606–17. [22] Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2008:224–44. [23] Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm Pure Appl Math 2004;57(11):1413–57. [24] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat 2004;32(2):407–99. [25] Drori I, Donoho DL. Solution of l1 minimization problems by LARS/homotopy methods. In: 2006 IEEE international conference on acoustics, speech and signal processing, 2006. ICASSP 2006 Proceedings. vol. 3. IEEE; 2006. III-III. [26] Rosasco L, Verri A, Santoro M, Mosci S, Villa S. Iterative projection methods for structured sparsity regularization. Tech. Rep., MIT Technical Reports, MITCSAIL-TR-2009-050, CBCL-282; 2009. [27] Bioucas-Dias J, Figueiredo M. A new TwIST: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Trans Image Process 2007;16(12):2992–3004. [28] Donoho DL. De-noising by soft-thresholding. IEEE Trans Inf Theory 1995;41(3):613–27. [29] Mairal J, Bach F, Ponce J, Sapiro G. Online dictionary learning for sparse coding. In: Proceedings of the 26th annual international conference on machine learning. ACM; 2009. p. 689–96. [30] Bertsekas DP. Nonlinear programming. Athena Scientific; 1999. [31] Pati YC, Rezaiifar R, Krishnaprasad P. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: 1993 Conference record of the twenty-seventh Asilomar conference on signals, systems and computers, 1993. IEEE; 1993. p. 40–4. [32] Khan AM, Rajpoot N, Treanor D, Magee D. A non-linear mapping approach to stain normalisation in digital histopathology images using image-specific colour deconvolution. IEEE Trans Biomed Eng 2014;61(6): 1729–38. [33] Li SZ, Hou X, Zhang H, Cheng Q. Learning spatially localized, parts-based representation. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition, 2001. CVPR 2001, vol. 1. IEEE; 2001. p. 1–207. [34] Hoyer PO. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 2004;5:1457–69. [35] Thiagarajan JJ, Ramamurthy KN, Spanias A. Shift-invariant sparse representation of images using learned dictionaries. In: IEEE workshop on machine learning for signal processing, 2008. MLSP 2008. IEEE; 2008. p. 145–50. [36] Chang H, Loss LA, Parvin B. Nuclear segmentation in H and E sections via multireference graph-cut (MRGC). In: International symposium biomedical imaging (ISBI). 2012. [37] Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science 2000;290(5500):2323–6. [38] Zhou Y, Barner K. Locality constrained dictionary learning for nonlinear dimensionality reduction. IEEE Signal Process Lett 2013;20(4):335–8. [39] Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y. Locality-constrained linear coding for image classification. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2010. p. 3360–7. [40] Khan AM, Mohammed AF, Al-Hajri SA, Shamari HMA, Qidwai U, Mujeeb I, et al. A novel system for scoring of hormone receptors in breast cancer histopathology slides. In: 2014 Middle east conference on biomedical engineering (MECBME). IEEE; 2014. p. 155–8. [41] Aloraidi NA, Sirinukunwattana K, Khan AM, Rajpoot NM. On generating cell exemplars for detection of mitotic cells in breast cancer histopathology images. In: Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE. IEEE; 2014. p. 3370–3.

Cell words: modelling the visual appearance of cells in histopathology images.

Detection and classification of cells in histological images is a challenging task because of the large intra-class variation in the visual appearance...
4MB Sizes 1 Downloads 6 Views