Transfer Learning for Bag-of-Visual Words Approach to NBI endoscopic image classification Shoji Sonoyama1 , Tsubasa Hirakawa1 , Toru Tamaki1 , Takio Kurita1 , Bisser Raytchev1 , Kazufumi Kaneda1 , Tetsushi Koide2 , Shigeto Yoshida3 , Yoko Kominami4 , Shinji Tanaka4 Abstract— We address a problem of endoscopic image classification taken by different (e.g., old and new) endoscopies. Our proposed method formulates the problem as a constraint optimization that estimates a linear transformation between feature vectors (or Bag-of-Visual words histograms) in a framework of transfer learning. Experimental results show that the proposed method works much better than the case without feature transformation.

I. I NTRODUCTION Colorectal cancer is one of the major cause of cancer death all over the world [1]. Among many inspection methods for colorectal cancer, endoscopic examination called colorectal endoscopy or colonoscopy using Narrow Band Imaging (NBI) system is widely used to diagnose colorectal cancer [2]. During colonoscopy, endoscopists observe a tumor displayed on a monitor to see if treatments and resection of the tumor are necessary. As a diagnostic criteria, an NBI magnification findings has been proposed by Hiroshima University Hospital [3], [4], which categorizes appearances of tumors into type A, B, and C, and type C is further subclassified into C1, C2, and C3 based on microvessel structures (see Figure 1). Meanwhile, a computer-aided diagnosis system for colonoscopy based on appearances of tumors has been widely studied because it would be helpful to support diagnosis during examinations. Therefore a lot of patchbased classification methods for endoscopic images have been proposed for pit pattern images [5], [6], [7] and for NBI endoscopic images [8], [9], [10], [11] over the last decade. The problem addressed by this paper is the inconsistency between training and testing images. Training of classifiers for classification methods mentioned above assume that the distributions of features extracted from both training and testing image datasets are the same, as other usual machine learning approaches. However, different endoscopies may be used for training and testing datasets. One reason is due *This work was supported in part by JSPS KAKENHI Grant Numbers 14J00223, 30243596, and 00335689, and also the Mazda Foundation grant number 13KK-210, and the Semiconductor Technology Academic Research Center (STARC). 1 Department of Information Engineering, Graduate School of Engineering, Hiroshima University, 1-4-1 Kagamiyama, Higashi-hiroshima, Hiroshima 739-8527, Japan 2 Research Institute for Nanodevice and Bio Systems (RNBS), Hiroshima University, 1-4-2 Kagamiyama, Higashi-Hiroshima 739-8527 Japan 3 Department of Gastroenterology, Hiroshima General Hospital of West Japan Railway Company, 3-1-36 Futabanosato, Higashiku, Hiroshima 7320057, Japan 4 Department of Endoscopy, Hiroshima University Hospital,1-2-3 Kasumi, Minami-ku, Hiroshima 734-8551, Japan

978-1-4244-9270-1/15/$31.00 ©2015 IEEE

to the rapid development of medical devices (endoscopies in our case), and hospitals could introduce new endoscopes at a certain point in time after training images were taken. Another reason is that a dataset is constructed for training classifiers in a hospital with a certain type of endoscope, while another hospital could use the classifiers with a different endoscope. In general, such kind of inconsistency could lead to a deterioration of the classification performance. Hence collecting new images for a new training dataset might be a possible solution to the problem, but it is not the case with our medical images: it is impractical to collect a enough set of images for all types and manufactures of endoscopes. Figure 2 shows an example of the difference between appearance of texture taken by different endoscope systems. These images show the same scene of a printed sheet of a colorectal polyp taken by different system at almost the same distance to the sheet from the endoscopes. Even for the same manufacture (Olympus) and the same modality (NBI), images differ to each other in resolutions, image quality and sharpness, brightness, viewing angle, and so on. This kind of difference may affect the classification performance. As a first attempt to tackle the problem, we propose a method for estimating the transformation between features of training and testing datasets taken by different (old and new) devices. Since we follow the work [11] that uses the Bag–of– Visual Words (BoVW) framework, features to be transformed are histograms of visual words. We formulate the problem as a constraint optimization, and develop an algorithm to estimate the transformation by using Alternating Direction Method of Multipliers (ADMM) [12]. II. R ELATED WORK A. Classification method Here we briefly review a classification method proposed by Tamaki et al. [13] using the Bag–of–Visual Words (BoW) framework with Scale Invariant Feature Transform (SIFT) followed by Support Vector Machine (SVM) classifiers in order to classify images into Types A, B, and C3. First, local features (SIFT) sn,r (r = 1, . . . , Rn ) are extracted from training images In (n = 1, . . . , N ). Next, K-means clustering is performed to these features for obtaining cluster centers as visual wordsVk (k = 1, . . . , K). Then, these visual words are used to Vector quantization of local features of each training images in order to construct a visual word histogram xn of In . Histograms of training images are then learned by an SVM classifier. For test images, the same visual words are used to make a histogram of each test image.

785

Type A!

Microvessels are not observed or extremely opaque.

Type B!

Fine microvessels are observed around pits, and clear pits can be observed via the nest of microvessels.

1

Microvessels comprise an irregular network, pits observed via the microvessels are slightly non-distinct, and vessel diameter or distribution is homogeneous.

Type C! 2

Microvessels comprise an irregular network, pits observed via the microvessels are irregular, and vessel diameter or distribution is heterogeneous.

W

W Target

Pits via the microvessels are invisible, irregular vessel diameter is thick, or the vessel distribution is heterogeneous, and a vascular areas are observed.

3

Fig. 1.

Source

NBI magnification findings [3]. Fig. 3. Outline of the proposed method. Target samples are transformed by a matrix W.

the following constraint optimization: min W

s.t.

tr(W) − log det W W0 kxi − yj kW ≤ u, (xi , yj ) ∈ the same class

(a)

(1)

kxi − yj kW ≥ l, (xi , yj ) ∈ different classes,

(b)

Fig. 2. An example of appearance difference of different endoscope systems. (a) An image taken by an older system (video system center: Olympus EVIS LUCERA CV-260, endoscope: Olympus OLYMPUS EVIS LUCERA CF-H260AZL/I). (b) An image of the same scene taken by an newer system (video system center: Olympus EVIS LUCERA ELITE CV290, endoscope: OLYMPUS EVIS LUCERA ELITE CF-HQ290ZL/I).

A straightforward way to handle the difference between training and test sets might be to ignore the difference and simply apply the classifier trained by the training set to the test set. Another possible solution might be the use of both old and new images as training set. This may work unless the distributions of new and old images are consistent, which is not expected in general. Therefore we extend the method above in a frame work of transfer learning, described next.

where u is the upper limit of Maharanobis distance between samples in the same class, and l is the lower limit of Maharanobis distance between samples in different classes. The estimated matrix is then decomposed as W = GT G to transform samples by Gxi and Gyj , and classified with nearest neighbor or SVM. This method is not preferable because it requires predefined parameters (u, l) to be tuned, and tries to classify both source and target samples at the same time. In our study, we only need to classify target samples. III. P ROPOSED METHOD We describe our proposed method (outlined in Figure 3) that transform target features (newer endoscopy) to source features (older endoscopy). Our method doesn’t require any parameters to tune, and focuses on target classification only. A. BoVW histogram transformation

B. Transfer learning In transfer learning, “one set of tasks is used to bias learning and improve performance on another task” [14]. The first task is called a source, and the second is a target. In our case, the source task is the recognition of an older endoscopic images Ins (n = 1, . . . , N ), and the target task is that of a t newer endoscopic images Im (m = 1, . . . , M ). This kind of transfer learning is categorized as domain adaptation [15]; there are two difference domains source and target, while the task is the same. An attempt similar to our problem has been done by Saenko et al. [16]. for recognizing both source and target images (of cups, keyboards, etc). They formulate the problem as a metric learning [17] which estimates matrix W of Maharanobis distance kxi − yj kW = (xi −yj )T W(xi −yj ) between samples of both local features xi from source images and yj from target images by solving

In this current study, we assume that each source image corresponds to one of target images in order to estimate the transformation; we have a BoVW histogram xn of a source image Ins corresponding to yn of a target image Int . We assume further that the transformation between source and target features is linear represented by a matrix W = [Wij ]. Now we formulate the problem as the following least-square optimization; min W

N X 1 kxn − Wyn k22 . 2 n=1

(2)

We need constraints on W to make sure that the transformed histogram Wyn is also a histogram. In other words, all elements must be positive. Those elements are linear combinations of elements in Y, which are positive, therefore the constraints to be imposed are that the weights of the

786

linear combinations must be positive. This results in the following constraint optimization problem; 1 min kX − WYk2F s.t. Wij ≥ 0, (3) 2 where X is a matrix whose columns are xn , and the same for Y. Once W is estimated, we use both of source features xn and transformed target features Wyn as training samples. For a test sample y in the target, we apply W to y, and then classify Wy as before. B. Optimization with ADMM Here we develop an algorithm to solve the constraint problem (3) with Alternating Direction Method of Multipliers (ADMM)[12]. By rewriting Eq. (3), we have 1 min kX − WYk2F + g(Z) s.t. W = Z, (4) 2 where g(W) is an indicator function whose value is zero when the constraints Wij ≥ 0 are satisfied, otherwise infinity. The augmented Lagrangean of Eq. (4) is given by ρ 1 L(W, Z, U) = kX − WYk2F + kW − Z + Uk2F , (5) 2 2 where U is a matrix of Lagrangeans, and ρ is a balancing parameter. Instead of directly solving Eq. (5), we propose to solve each row wk of W (k = 1, . . . , K) separately for efficiency. For k-th row wk , the augmented Lagrangean is written as follows: N K X X 1 (xnk − wkT yn )2 L(W, Z, U) = 2 n=1

(a)

(c)

Fig. 4. Dataset used for experiments. (a) Examples of source images and (b) corresponding target images.

IV. E XPERIMENTAL RESULTS AND DISCUSSIONS

We describe experimental results of a simulation where a target data set is generated from a corresponding source dataset. The source image dataset consists of 908 NBI colorectal endoscopic images (Type A: 359, Type B: 462, Type C3: 87) taken by the older endoscopy system (Figure 2(a)) 1 . We still have not had yet a enough amount of images of different classes taken by the newer endoscopy system (Figure 2(b)), therefore we simulate the difference by performing image processing to the source dataset. We suppose that images taken by the newer endoscopy system captures are more contrast enhanced and sharp, and then use the contrast enhancement and unsharp mask filtering . Figures 4(a)(b) show examples of corresponding source and target images. k=1 We evaluate the following four different settings in terms K X ρ of 10-fold cross validation (Figure 5). 2 + kwk − zk + uk k2 , (6) 2 a) Baseline: For comparison, we first perform an exk=1 periment without any transfer learning by using the source where xnk is k-th element of xn ∈ RK , and zk , uk are k-th dataset only. In this case, the source dataset is divided into rows of Z, U respectively. 10 folds; a normal 10-fold cross validation with the source At each iteration h, first we fix zhk , uhk and solve it for dataset. h+1 wk . By taking the derivative, we have b) Source only: The second setting also doesn’t use N X any transfer learning but with the source and target datasets. ∂L = (yn ynT wk −xin yn )+ρ(wk −zhk +uhk ) = 0, (7) This means that source images are used for training, and ∂wk n=1 the resulting classifier is simply used to classify target (test) and by rearranging it we have images. We divide the source and target datasets in to 10   1 ! folds (each source fold in the source corresponds to a target N N X X h+1 T h h yn yn + ρE wk =  xin yn + ρ(uk − zk ) , fold). Then 9 source folds are used for training, and the remaining one target fold is used for testing. j=1 i=1 c) Source+Target: The third setting doesn’t use any (8) transfer learning, too. Target images are however also used where E is a K × K identity matrix. Then we obtain for training unlike the previous settings above. We divide zh+1 = Πc (wkh+1 + uhk ) (9) the source and target datasets in to 10 folds, then 18 folds (9 k uh+1 = uhk + wkh+1 − zh+1 (10) source and 9 target folds) are used for training a classifier. k k The remaining target fold is used for testing by simply where Πc (w) is a convex projection on to the feasible applying the trained classifier. Note that the number of region where g(w) = 0. More precisely, it takes a vector and truncates all negative elements to zero. We repeat the 1 The study was conducted with an approval from the Hiroshima Univerthree steps above to obtain wkh+1 , zh+1 , uh+1 in turn until sity Hospital ethics committee, and an informed consent was obtained from k k the patients and/or family members for the endoscopic examination. convergence.

787

Source

Target

research interest is how to estimate W stably with a small set of corresponding source and target images. It is not possible to correct corresponding source and target images of the same polyp while we have assumed so far. Our future research includes to invent an efficient “calibration” of W with a corresponding dataset other than actual NBI colon polyps and cancers.

Baseline 9

1

Source

Target

Source

Target

Source

Target

Source_only Source+Target

Proposed_method Training:

R EFERENCES

Test:

Fig. 5. Four different experimental settings over different number of visual words (or feature dimensions). 1.0

Recognition rate

0.9

Baseline Source_only Source+Target Proposed_1 Proposed_2

0.8

0.7

0.6

0.5

Fig. 6. color.

24

48

96 # of Visual Words

192

384

Performance with different experimental settings. Best viewed in

training images is now doubled while the number of testing images remains the same. d) Proposed method (1 and 2): The last one is our proposed method, but there are two variations. First, we use 9 source and 9 target folds for estimating the transformation W. For the first variation (proposed 1), 9 source folds are used for training; For the second variation (proposed 2), those 18 folds are used for training, while features yn in the 9 target folds are transformed by W as Wyn . The remaining target fold is used for testing, however features in it are transformed by W as in the training, and then the trained classifier is applied. Results with these settings are shown in Figure 6. The performance of the baseline is about 90% at the right end of the plot (where the largest number of feature dimensions is used). If the training uses source images only (source only) or both source and target images without transfer (source + target), the performance is far below the baseline. In contrast, the proposed method settings (1 and 2) keep the performance similar to the baseline as we have expected. Another possible setting of experiments is target only; target images are used for training and also testing. This is only possible when we have an enough amount of training images of target (i.e., with a newer endoscope), and this is not the case we are considering in this paper. In this sense, the setting of proposed 2 that uses the same amount of source and target images is also impractical even it can achieve a better result. The setting of proposed 1, which uses source images only for training a classifier, is a promising approach because it needs the transformation W only. Current our

[1] Cancer Research UK, “Bowel cancer statistics.” [2] S. Tanaka, T. Kaltenbach, K. Chayama, and R. Soetikno, “Highmagnification colonoscopy (with videos),” Gastrointest Endosc, vol. 64, pp. 604–13, Oct 2006. [3] H. Kanao, S. Tanaka, S. Oka, M. Hirata, S. Yoshida, and K. Chayama, “Narrow-band imaging magnification predicts the histology and invasion depth of colorectal tumors.,” Gastrointest Endosc, vol. 69, pp. 631–636, Mar 2009. [4] S. Oba, S. Tanaka, S. Oka, H. Kanao, S. Yoshida, F. Shimamoto, and K. Chayama, “Characterization of colorectal tumors using narrowband imaging magnification: combined diagnosis with both pit pattern and microvessel features,” Scand J Gastroenterol, vol. 45, pp. 1084– 92, Sep 2010. [5] M. H¨afner, A. Gangl, M. Liedlgruber, A. Uhl, A. V´ecsei, and F. Wrba, “Classification of endoscopic images using delaunay triangulationbased edge features,” in Image Analysis and Recognition (A. Campilho and M. Kamel, eds.), vol. 6112 of Lecture Notes in Computer Science, pp. 131–140, Springer Berlin Heidelberg, 2010. [6] M. H¨afner, A. Gangl, M. Liedlgruber, A. Uhl, A. Vecsei, and F. Wrba, “Endoscopic image classification using edge-based features,” in Pattern Recognition (ICPR), 2010 20th International Conference on, pp. 2724–2727, Aug 2010. [7] R. Kwitt, A. Uhl, M. H¨afner, A. Gangl, F. Wrba, and A. V´ecsei, “Predicting the histology of colorectal lesions in a probabilistic framework,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 103–110, June 2010. [8] S. Gross, T. Stehle, A. Behrens, R. Auer, T. Aach, R. Winograd, C. Trautwein, and J. Tischendorf, “A comparison of blood vessel features and local binary patterns for colorectal polyp classification,” 2009. [9] T. Stehle, R. Auer, S. Gross, A. Behrens, J. Wulff, T. Aach, R. Winograd, C. Trautwein, and J. Tischendorf, “Classification of colon polyps in nbi endoscopy using vascularization features,” 2009. [10] J. J. W. Tischendorf, S. Gross, R. Winograd, H. Hecker, R. Auer, A. Behrens, C. Trautwein, T. Aach, and T. Stehle, “Computer-aided classification of colorectal polyps based on vascular patterns: a pilot study,” Endoscopy, vol. 42, pp. 203–7, Mar 2010. [11] T. Tamaki, J. Yoshimuta, M. Kawakami, B. Raytchev, K. Kaneda, S. Yoshida, Y. Takemura, K. Onji, R. Miyaki, and S. Tanaka, “Computer-aided colorectal tumor classification in NBI endoscopy using local features,” Medical Image Analysis, vol. 17, no. 1, pp. 78 – 100, 2013. [12] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction R in Machine Learnmethod of multipliers,” Foundations and Trends ing, vol. 3, no. 1, pp. 1–122, 2011. [13] T. Tamaki, J. Yoshimuta, M. Kawakami, B. Raytchev, K. Kaneda, S. Yoshida, Y. Takemura, K. Onji, R. Miyaki, and S. Tanaka, “Computer-aided colorectal tumor classification in nbi endoscopy using local features,” Medical image analysis, vol. 17, no. 1, pp. 78– 100, 2013. [14] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich, “To transfer or not to transfer,” in In NIPS’05 Workshop, Inductive Transfer: 10 Years Later, 2005. [15] S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledge and Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345– 1359, 2010. [16] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Computer Vision–ECCV 2010, pp. 213–226, Springer, 2010. [17] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proceedings of the 24th international conference on Machine learning, pp. 209–216, ACM, 2007.

788

Transfer learning for Bag-of-Visual words approach to NBI endoscopic image classification.

We address a problem of endoscopic image classification taken by different (e.g., old and new) endoscopies. Our proposed method formulates the problem...
563B Sizes 0 Downloads 11 Views