YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Journal of Biomedical Informatics xxx (2014) xxx–xxx 1

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin 4 5

Histology image search using multimodal fusion

3 6

Q1

7 8

Q2

Juan C. Caicedo a,⇑,1, Jorge A. Vanegas b, Fabian Paez b, Fabio A. González b a b

9 10

University of Illinois at Urbana–Champaign, IL, USA MindLab Research Laboratory, Universidad Nacional de Colombia, Bogotá, Colombia

a r t i c l e

1 5 2 2 13 14 15 16

i n f o

Article history: Received 18 June 2013 Accepted 30 April 2014 Available online xxxx

17 18 19 20 21 22 23 24

Keywords: Histology Digital pathology Image search Multimodal fusion Visual representation Semantic spaces

a b s t r a c t This work proposes a histology image indexing strategy based on multimodal representations obtained from the combination of visual features and associated semantic annotations. Both data modalities are complementary information sources for an image retrieval system, since visual features lack explicit semantic information and semantic terms do not usually describe the visual appearance of images. The paper proposes a novel strategy to build a fused image representation using matrix factorization algorithms and data reconstruction principles to generate a set of multimodal features. The methodology can seamlessly recover the multimodal representation of images without semantic annotations, allowing us to index new images using visual features only, and also accepting single example images as queries. Experimental evaluations on three different histology image data sets show that our strategy is a simple, yet effective approach to building multimodal representations for histology image search, and outperforms the response of the popular late fusion approach to combine information. Ó 2014 Published by Elsevier Inc.

26 27 28 29 30 31 32 33 34 35 36 37 38 39

40 41

1. Introduction

42

Digital pathology makes it easy to exchange histology images and enables pathologists to rapidly study multiple samples from different cases without having to unpack the glass [1]. The increasing adoption of digital repositories for microscopy images results in large databases with thousands of records, which may be useful to supporting the decision making process in clinical and research activities. However, in modern hospitals and health care centers, the number of images to keep track of is beyond the ability of any specialist. A very promising direction to realize the potential of these collections is through efficient and effective tools for image search. For instance, when a new slide is being observed, a camera coupled to the microscope can capture the current view, send the picture to the retrieval system, and show results on a connected computer. These results can help to clarify structures in the observed image, explore previous cases and, in general, may allow clinicians and researchers to explore large collections of records previously evaluated and diagnosed by other physicians.

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

⇑ Corresponding author. Current address: Computer Science Department,

Q1

University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801, USA. E-mail addresses: [email protected] (J.C. Caicedo), [email protected] (J.A. Vanegas), [email protected] (F. Paez), [email protected] (F.A. González). 1 Work done while at Universidad Nacional de Colombia.

The query-by-example paradigm for image search—when the user’s query is an example image with no annotations—has a number of potential applications in medicine and clinical activities [2]. The main challenge when implementing such a system consists of correctly defining the matching criteria between query images and database images. The standard approach for content-based retrieval in image collections relies on using similarity measures between low-level visual features to perform a nearest-neighbor search [3]. The problem of this approach is that these characteristics usually fail to capture high-level semantics of images, a problem known as the semantic gap [4]. Different methods to bridge this gap have been proposed to build a model that connects lowlevel features with high-level semantic content, such as automatic image annotation [5] and query by semantic example [6]. These methods represent images in a semantic space spanned by keywords, so a nearest neighbors search in that space retrieves semantically related images. Approaches like these have also been investigated for histology image search [7–9]. Image search systems based on a semantic representation have been shown to outperform purely visual search systems in terms of Mean Average Precision (MAP) [3]. However, these approaches may lose the notion of visual similarity among images since the search process ends up relying entirely on high level descriptions of images. The ranking of search results is based on potentially relevant keywords, ignoring useful appearance clues that are not described by index terms. In a clinical setting, visual information plays an important role for searching histology images, which

http://dx.doi.org/10.1016/j.jbi.2014.04.016 1532-0464/Ó 2014 Published by Elsevier Inc.

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134

2

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

ultimately reveals the biological evidence for the decision making process in clinical activities. We consider that both visual content and semantic data are complementary sources of information that may be combined to produce high quality search results. Multimodal fusion has emerged as a very useful approach to combine different signals with the purpose of making certain semantic decisions in automated systems. We refer the reader to [10] for a comprehensive survey of multimodal fusion in various multimedia applications. For image indexing in particular, multimodal fusion consists of combining visual and semantic data. Several methodologies have recently been proposed to model the relationships between these two data modalities, with the goal of constructing better image search systems. Two main strategies may be identified to achieve the combination of both data modalities: (1) early fusion [11], to build a combined representation of images before the ranking procedure and (2) late fusion [12], to combine similarity measures during the ranking procedure. One of the advantages of early fusion over late fusion is that the former often benefits from explicitly modeling the relationships between the two data modalities, instead of simply using them as separate opinions. However, this clearly requires a significant effort in understanding and extracting multimodal correspondences. In this work, we propose a novel method for indexing histology images using an early multimodal fusion approach, that is, combining the two data modalities in a single representation to generate the ranking directly in such a space. The proposed methods use semantic annotations as an additional data source that represents images in a vector space model. Then, matrix-factorization-based algorithms are used to find the relationships between data modalities, by learning a function that projects visual data to the semantic space and the other way around. We take advantage of this property by fusing both data modalities in the same vector space, obtaining as a result the combined representation of images. A systematic experimental evaluation was conducted on three different histology image databases. Our goal is to validate the potential of various image search techniques to understand the strengths and weaknesses of visual, semantic and multimodal indexing in histology image collections. We focus our evaluation on two performance measures commonly used for information retrieval research: Mean Average Precision (MAP); and Precision at the first 10 results of the ranked list (P@10), for early precision. We observed that semantic approaches are very good at maximizing MAP, while visual search is a strong baseline for P@10, revealing a trade-off in performance when using one or the other representation. This also confirms the importance of combining both data modalities. Our approach combines multimodal data using a convex combination of the visual and semantic information, resulting in a continuous spectrum of multimodal representations and allowing

us to explore various mixes from purely visual to purely semantic representations as needed. This is similar in spirit to late fusion, which allows the setting of weights to scores produced by each modality. However, our study shows significant improvement in performance when building an explicitly fused representation, instead of considering modalities as separate voters for the rank of images. We also found that multimodal fusion can balance a trade-off between maximizing MAP and early precision, demonstrating the potential to improve the response of histology image retrieval systems.

135

1.1. Overview

145

This work proposes an indexing technique for image search, using both visual image content and associated semantic terms. Fig. 1 illustrates a pipeline for image search in a clinical setting, which involves a physician or expert pathologist working with a microscopy equipment with digital image acquisition capabilities or in a virtual microscopy system. Through an interactive mechanism, the user can ask the system to take a picture of the current view, and send a query to the image search system. The system has a pre-computed fused representation of images in the database. A ranking algorithm is used to identify the most relevant results in the database, which are retrieved and presented to the user. The main goal of the system is to support to clinicians during the decision making process by providing relevant associated information. The ability to find related cases among past records in a database has the potential to improve the quality of health care using an evidence-based reasoning approach. Historic archives in a hospital comprise a knowledge base reflecting its institutional experience and expertise, and it can be used to enhance the daily medical practice. This paper focuses on two important aspects of the entire pipeline: (1) strategies for constructing the index based on a multimodal fused representation and (2) an empirical evaluation of different strategies for histology image search using collections of real diagnostic images. The main contribution of our work is a novel method for combining visual and semantic data in a fused image representation, using a computationally efficient strategy that outperforms the popular late-fusion approach, and that balances the trade-off between visual and semantic data. While the applicability of the proposed model may be extended to general image collections beyond histology images, the second contribution of this work is an extensive evaluation on histology images, since a straightforward application of image retrieval techniques may not result in an optimal outcome. Part of our experimental evaluation shows that off-the-shelf indexing methods such as latent semantic indexing and late fusion do not always exploit specific characteristics of histology images.

146

Fig. 1. Overview of the image search pipeline. Images acquired in a clinical setting are used as example queries. The system processes and matches queries with entries in a multimodal index, which represent images and text in the database. Results are returned to support the decision making process.

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

136 137 138 139 140 141 142 143 144

147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

Another question related to the use of a system like this in a real environment is what would be the impact of search results in the medical practice itself and the outcome of such decisions on the quality of health care for patients. We believe this question is both very interesting in nature and quite important to be investigated. Up to our knowledge, a formal study of this problem has not been conducted yet in the histology domain, and it is also beyond the scope of this paper. However, other studies in radiology have shown improved decisions in the final diagnosis made by inexperienced physicians when they use image retrieval technologies [13]. The contents of this paper are organized as follows: Section 2 discusses relevant related work in histology image retrieval. The three histology image data sets used for experimental evaluations are presented in Section 3. Section 4 introduces the proposed algorithms and methods for multimodal fusion. The experimental evaluation and results are presented in Section 5. Finally, Section 6 summarizes and presents the concluding remarks.

199

2. Previous work

200

208

The automatic analysis of histology images is an important and growing research field that comprises different purposes and techniques. From image classification [14] to automatic pathology grading [15], the large amount of microscopy images in medicine may benefit from automated methods that allow users to manage visual collections for supporting the decision making process in clinical practice. This work is primarily focused on image search and retrieval technologies, which serve as a mechanism to find relevant and useful histology images from an available database.

209

2.1. Content-based medical image retrieval

210

236

Early studies of content-based medical image retrieval were reviewed by Mller et al. [2]. One of the first systems for histology image retrieval was reported by Zheng et al. [16], which uses lowlevel visual features to discriminate between various pathological samples. The use of low-level features was quickly recognized to have limitations for distinguishing among complex semantic arrangements in histology images, so researchers proposed the semantic analysis of histology slides to build image search systems. Later, a system based on artificial neural networks that learned to recognize twenty concepts of gastro-intestinal tissues on digital slides was presented by Tang et al. [7]. The system allowed clinicians to query using specific regions on example histology images. However, important efforts to collect labeled examples were required since the design of these learning algorithms needed local annotations, a procedure that might be very expensive. Relatively few works continued the effort of designing semantic image search systems for histology images, which include the work of Naik et al. [8] on breast tissue samples, and Caicedo et al. [17] on skin cancer slides. However, the task of histology image classification has been actively explored on various microscopy domains [18–22] which are related to semantic retrieval. The primary purpose of these methods is to try to assign correct labels to images, which differs from the problem of building a multimodal representation. Besides, the transformation from visual content to strict semantic keywords, may lead to a loss of useful visual information for a search engine, since images are summarized in a few keywords and visual details are not considered anymore.

237

2.2. Multimodal fusion for medical image retrieval

238

Multimodal retrieval has been approached in the medical imaging domain, to find useful images in academic journal repositories and biomedical image collections by combining captions along

201 202 203 204 205 206 207

211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235

239 240

3

with visual characteristics to find relevant results using late fusion [23,24]. However, their strategies assume that the user’s query is composed of example images as well as a text description. If users only provide example images, because they do not know precise terms or just because of another practical reason, the system does not have any other choice rather than matching purely visual content. Relevant work by Rahman et al. [25] uses fusion techniques for biomedical image retrieval by combining multiple features and scores of classifiers on the fly, allowing users to interact and provide relevance feedback to refine results. For ten years, the ImageCLEFmed community also dedicated efforts to study the problem of medical image retrieval using multimodal data around an academic challenge [26–29]. Each year, various research groups obtained a copy of a single collection of medical images that includes almost all medical imaging techniques at once, such as X-rays, PET, MRI, CT and microscopy. The goal was to index its contents and provide answers for specific queries composed of example images and text. Multimodal fusion was a central idea in the development of solutions in this challenge, and late fusion has been reported as one of the most robust techniques. Our work is focused on combining semantic terms and visual features in a fused image representation (based on early fusion principles), which can be used for image retrieval or as input for other classifiers and systems. An important component of the proposed strategy is its ability to use the same representation for images that do not have text annotations. In that way, the system can handle example image queries as well as database images without semantic meta-data. In this work, we build on top of Nonnegative Matrix Factorization algorithms recently proposed to find relationships in multimodal image collections [30]. We extend these ideas to propose a novel algorithm for fusing multimodal information in histology image databases with an arbitrary number of terms. Other studies of multimodal fusion for multimedia retrieval have been conducted recently. A strategy to summarize and browse large collection of Flickr images using text and visual information was presented by Fan et al. [31]. However, they learn latent topics for text independent of visual features, so multimodal relationships are not modeled or extracted. The use of latent topic models has been extended to explicitly find relationships between visual features and text terms using probabilistic latent semantic indexing [32] and very rich graphical models [33]. In this work, we formulated the problem of extracting multimodal relationships as a subspace learning problem, which generates multimodal representations using vector operations. Recent studies of multimodal fusion for histology images deal with the problems of combining different imaging modalities (such as MRI images and microscopy images) [34] or combining decisions made at different regions of the same image [35]. As to our knowledge, our work is the first study of multimodal fusion of semantic and visual data, specifically oriented to histology image retrieval. We reported promising experimental results in our previous work [36], and this paper extends that evaluation in substantial ways. First, the notion of multimodal fusion by back-projection is introduced for the first time, which allows to effectively combine visual and semantic representations for histology image indexing. Second, a more comprehensive experimentation was carried out, using three different data sets, additional evaluations and extended discussions.

241

3. Histology image collections

301

Three different histology image collections were selected as case studies for this work. The first two are from pathology cases with

302

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300

303

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

4

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

Table 1 Number of images and semantic terms on each data set. Data set

Images

Training

Query

Terms

Cervical Cancer Basal-cell Carcinoma Histology Atlas

530 1502 2641

447 1201 2113

54 301 528

8 18 46

by pathologists and biologists, indicating the observed biological system and organs, giving a total of 46 different indexing terms. The resulting annotations include terms like circulatory system, heart, lymphatic system and thymus, among others. This data set has also been used in previous work at our lab [38] and is the only one currently available online free of charge.2

361 362 363 364 365 366 367

304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360

corresponding diagnosis and descriptions. They were collected as part of different long-term projects in the Pathology and the Biology departments, with the collaboration of several experts and graduate students from the Medicine School of Universidad Nacional de Colombia, in Bogot. The data sets were collected and annotated by expert pathologists, extracting information of a larger database of real cases. In general, these collections of images have been annotated by several individuals, which agree on the results after discussions in a committee-like process. The efforts of collecting these annotations have been oriented to create index terms to preserve information related to cases, which could be accessible through an information retrieval system. These cases were anonymized to remove any information related to patients, and only data associated to the diagnosis and description of images were preserved. The third data set is part of a histology atlas containing images from the four fundamental tissues of living beings. These images were collected and labeled by researchers in the Biology Department, to provide students with a high quality reference material in digital format. Table 1 presents basic statistics of these collections and Fig. 2 shows some example images for each data set. More details about each data set are presented below. 1. Cervical Cancer. This data set, with 530 images from more than 120 cases, characterizes various conditions and stages of Cervical Cancer. Images in this collection were acquired by a medical resident and validated by an expert pathologist from tissue samples stained with hematoxylin and eosin. Images were captured at 40 magnification with controlled lighting conditions, and from each slide, an average of 4.5 sub-regions of 3840  3072 pixels were selected. Each image preserves as metadata the case number to which it belongs and a list of global annotations. Annotations span eight different categories including relevant diagnostic information and other tissue characteristics. This list of categories includes:cervicitis inflammatory pathology, intraepithelial lesion, squamous cell carcinoma, metaplasia, and intraepithelial lesion, among others. 2. Basal-cell Carcinoma. This collection has 1502 images of skin samples stained with hematoxylin and eosin used to diagnose cancer from a collection of more than 300 cases. About 900 of these images correspond to pathological cases, while the remaining 600 are from normal tissue samples, which allows physicians to contrast differences between both conditions. This makes a difference with respect to the previous data set, which only has pathological cases. This dataset contains images acquired at different magnification levels, including 8, 10 and 20, and stored at 1280  1024 pixels in JPG format. Global annotations were assigned by a pathologist to highlight various tissue structures and relevant diagnostic information using a list of eighteen different terms. This collection has been used in previous histology image retrieval work [37,9]. 3. Histology Atlas. This is the largest data set used in our study, with 2641 images illustrating biological structures of the four fundamental tissues in biology: connective, epithelial, muscular and nervous. Images of these tissues come from different organs of several young–adult mice, where samples were stained using hematoxylin and eosin, and immunohistochemical techniques. These images are in different resolutions and magnification factors, and are organized in hierarchical annotations generated

Notice that in all cases, a single image can have several semantic terms associated to it. This is an important characteristic of real world databases, that do not split collections in disjoint sets, but rather allow images to have multiple annotations to describe several aspects or objects within the image. Images are usually regions of interest with patterns observed in full tissue slides, which were focused and selected by the team of pathologists and biologists. They rarely included views of a complete tissue slide. The resulting image collections present variations in magnification and acquisition style, which are considered natural properties of large-scale, real world medical image collections. The acquisition process was not restricted to highly controlled or rigid image views, but instead, encouraged spontaneous variability motivated by domain-specific interestingness, which may make the search process more challenging.

368

3.1. Data representation

383

3.1.1. Visual representation A large variety of methods have been investigated to extract and represent visual characteristics in histology images. Be it for automated grading [15], classification [20] or image retrieval [7], two important features are usually modeled: color and texture. Color features exploit useful information associated with staining levels, which are natural bio-markers for pathologists. Texture features exploit regularities in biological structures, since tissues tend to follow homogeneous patterns. In this work, a bag-of-features representation is used, which has been shown to be a useful representation of histology images due to its ability to adaptively encode distributions of textures in an image collection. We selected the Discrete Cosine Transform (DCT) computed at each RGB color channel as local feature descriptor. A dictionary of 500 codeblocks is constructed using the k-means algorithm for each image collection separately. Then, a histogram of the distribution of these codeblocks is computed for each image. As a result, we have vectors in Rn , with n ¼ 500 visual features for each image. When appropriately trained, the dictionary is able to encode meaningful patterns that correlate with high-level concepts such as the size and density of nuclei, which may allow the system to distinguish important features such as magnification factor and tissue type. We refer the reader to the work of Cruz-Roa et al. [38] for more details about this histology image representation approach, which we followed closely in this work.

384

3.1.2. Semantic representation Likewise, semantic data is herein represented as a bag-of-words following a vector space strategy, commonly used in information retrieval [39]. First, the dictionary of indexing terms is constructed using the list of available keywords. Then, assuming a dictionary with m terms, each image is represented as a binary vector in Rm , in which each dimension indicates whether the corresponding semantic term is assigned to the image. Using this representation, each image can have as many semantic terms assigned as needed. Also, the size of the semantic dictionary is not limited and can be easily extended.

409

2 Dataset of 20,000 histology images at http://www.informed.unal.edu.co Dataset of tissue types at http://168.176.61.90/histologyDS/.

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

369 370 371 372 373 374 375 376 377 378 379 380 381 382

385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408

410 411 412 413 414 415 416 417 418 419

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

5

Fig. 2. Sample images and annotations from the three histology data sets used in this work: (a) Cervical Cancer data set. (b) Basal Cell Carcinoma data set. (c) Histology Atlas data set. These sample images have been selected to illustrate the kind of contents and annotations available in each collection.

420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445

The number of different terms for each data set can be found in the last column of Table 1. None of the images in the three data sets have all annotations at the same time. Usually, a single image has between two to four semantic annotations assigned to it depending on the data set. These keywords may co-occur in some cases, and they can also exclude each other in some cases. We do not exploit these term relationships explicitly since the bag-of-words representation has been adopted. Notice that we use the term semantics to refer to semantic terms only. In this work, the relationships among terms are not explicitly considered through the use of ontologies or similar data structures. The use of semantics throughout the paper is intended to emphasize our goal of assigning high-level interpretations to low-level visual signals, which are not easily understood by computers in the same way as humans do. Smeulders et al. [4] named this condition as the semantic gap, and many other studies thereafter have adopted similar uses of the term to refer to this problem [40]. Since both, visual and semantic representations are vectors, a database of images can be represented with two matrices by stacking the corresponding vectors of visual and semantic features as columns of two matrices. The notation used in the following sections sets the matrix of visual data for a collection of l images as X V 2 Rnl , where n is the number of visual patterns in the bag of features representations. The matrix of semantic terms for the same collection is X S 2 Rml , where m is the number of keywords in the semantic dictionary.

446

4. Multimodal fusion

447

The search method proposed in this work is based on a multimodal representation of images that combines visual features with semantic information. Fig. 3 presents an overview of the proposed approach, which is comprised of three sequential stages: (1) visual indexing, (2) semantic embedding and (3) multimodal fusion. Three image representations are obtained throughout the process: (1) visual features, (2) semantic features and (3) the proposed fused representation. The retrieval engine can be setup to search using any of the three representations. In the following subsections, we assume a visual and semantic data representation following the description of Section 3.1, and focus on describing the components 2 and 3 of Fig. 3.

448 449 450 451 452 453 454 455 456 457 458

4.1. Semantic embedding

459

The goal of a semantic embedding is to learn the relationships between visual features and semantic terms, to generate a new representation of images based on high level concepts. The strategy proposed in this work is based on a matrix factorization algorithm, which allows the system to learn these relationships as a linear projection between the visual and semantic spaces. In this work, we adopt the notions of multimodal image indexing using Nonnegative Matrix Factorization (NMF) recently proposed in [30], and extend these ideas by introducing a direct semantic embedding. In the following sections, two strategies for modeling visual-to-semantic relationships are presented.

460

4.1.1. Latent semantic embedding The first algorithm for semantic embedding is based on NMF, which allows us to extract structural information from a collection of data samples. For any input matrix X 2 Rnl , containing l data samples with n nonnegative features in its column vectors, NMF finds a low rank approximation of the data using non-negativity constraints:

471

462 463 464 465 466 467 468 469 470

472 473 474 475 476 477

X  WH

478 480

W; H P 0

481 483

where W 2 Rnr is the basis3 of the vector space in which the data will be represented and H 2 Rrl is the new data representation using r factors. Both, W and H are unknowns in this problem, and the decomposition can be found using alternating optimization. We use the divergence criterion to find an approximate solution with the multiplicative updating rules proposed by Lee and Seung [41]. Given the matrix of visual features, X V , and the matrix of semantic terms X S , we aim to find correlations between both. We model these relationships using a common latent space, in which both data modalities have to be projected. We employ the following two-stage approach to find semantic latent factors for image features and keywords: 3 The terms ‘‘basis’’ is slightly abusive here, since the vectors in the matrix W are not necessarily linearly independent and the set of vectors may be redundant.

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

461

484 485 486 487 488 489 490 491 492 493 494 495 496

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

6

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

Fig. 3. Overview of the proposed fused representation. From the input image to the final fused representation, three main processes are carried out: visual indexing, semantic embedding and multimodal fusion. Each stage also produces the corresponding representation: only visual, only semantic and fused.

497 498 499 500 501 502 503 504 505 506 507 508

1. Decompose X S : The matrix of semantic terms is first decomposed using NMF to find a low rank approximation of semantic data. This step may be understood as finding correlations between various terms according to their joint occurrence on images. The result of this first step, is a decomposition of the form X S ¼ W S H, with matrices W S as the semantic projection and H as the latent semantic representation. 2. Embed X V : Find a projection function for visual features to embed data in the latent factor space. This is achieved by solving the equation X V ¼ W V H for W V only, while fixing the latent representation, H, equal to the matrix found in the previous stage.

This is a rescaled gradient descent approach that uses a datadependent step size, following Lee and Seung’s methodology for NMF [41]. Then, the solution is found by iteratively running this updating rule for W for a number of iterations or until a fixed error reduction is reached. We refer to this algorithm as the Nonnegative Semantic Embedding (NSE) in our later discussions.

549

4.1.3. Projecting unlabeled images To recover the semantic representation of an image without keywords, we need to solve the following equation for xS :

555

The correlations between visual features and semantic data are encoded through the matrices W V ; W S and H. The new latent representation is semantic by design, since the function to project visual features spans a latent space that has been originally formed by semantic data. We refer to this algorithm as NMF Asymmetric (NMFA) in our later discussions.

xS P 0

509 510 511 512 513 514 515 516 517 518 519 520

521 523 524 526 527 528 529 530 531 532 533 534 535 536 537

538 540

541 542 543 544 545

546

548

4.1.2. Direct semantic embedding An alternative approach to model the relationships between visual features and semantic terms is to find a transformation function between both spaces directly. This problem is formulated as follows:

X V  WX S

ð1Þ

WP0 where W 2 Rnm is a matrix that approximately embeds visual features in the space of semantic terms. Instead of extracting a latent factor structure from the data, this strategy fixes the latent encoding (matrix H) as the known semantic representation of images in the collection, X s . This can be understood as requiring the latent factors to match exactly the semantic representation of images, resulting in a scheme for learning the structure of visual features that directly correlate with keywords. To solve this problem, the divergence between the matrix of visual features and the embedded data is adopted as objective function:

DðX V jWX S Þ ¼

X ij

ðX V Þij ðX V Þij log  ðX V Þij þ ðWX S Þij ðWX S Þij

!

ð2Þ

The goal is to minimize this divergence measurement on a set of training images, considering that this optimization problem is convex and can be solved efficiently following gradient descent or interior point strategies. In this work, the matrix W is learned using the following multiplicative updating rule:

. u ðX S Þju ðX V Þiu ðWX S Þiu P v ðX S Þjv

P W ij ¼ W ij

ð3Þ

xV  WxS

ð4Þ

 i W ia X il ðWHÞil P k W ka

P

ð5Þ

552 553 554

556 557

558 560

564 565 566 567 568 569 570 571 572 573

574 576

Following this procedure, a semantic representation of new images is constructed.

577

4.2. Fusing visual and semantic content

579

So far, we have considered the visual and semantic strategies for image search. The first strategy is entirely based on visual features, to match for visually similar images. The second strategy is based on an inferred semantic representation to match for semantically related images. In this section we introduce a third strategy, based on multimodal fusion. The main goal of this scheme is to combine visual features and semantic data together in the same image representation to exploit the best properties of each modality.

580

4.2.1. Fusion by back-projection The proposed fusion strategy projects semantic data back to the visual feature space to make a convex combination of both, visual and semantic representations, as illustrated in Fig. 4. This can be understood as an early fusion strategy, since the representations are merged before their subsequent use. Assuming a histogram of visual features xV and a vector of a predicted semantic representation, xS , the fusion procedure generates a new image representation defined as:

588

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

551

561 563

where xV is the observed vector of visual features and W is a learned semantic embedding function. This formulation is also compatible with the latent semantic embedding, assuming W ¼ W V and xS ¼ h. In both situations, the non-negativity constraint holds and the same procedure is followed. Thus, the problem in Eq. (4) can be formulated as minimizing the divergence between the observed visual data and its reconstruction from the semantic space. Regarding the non-negativity restriction, the solution can be efficiently approximated using the following multiplicative updating rule in an iterative fashion:

X Sal ¼ Hal

550

578

581 582 583 584 585 586 587

589 590 591 592 593 594 595 596

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

7

Fig. 4. Illustration of the fusion by back-projection procedure. The first step represents a query image in the visual feature space. Second, a semantic embedding is done to approximate high-level concepts for that image. Third, the semantic representation is projected back to the visual space and combined with the original visual representation.

597 599

xf :¼ kxv þ ð1  kÞWxs

ð6Þ

604

where xf 2 Rn is the vector of fused features in the visual space and k is the parameter of the convex combination that controls the relative importance of data modalities. This fusion approach takes the semantic representation of images and projects it back to the visual space using the reconstruction formula:

605 607

^xv :¼ Wxs

600 601 602 603

608 609 610 611 612 613 614 615 616 617 618 619

620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641

ð7Þ

This back-projection is a linear combination of the column vectors in W using the semantic annotations as weights. In that way, the reconstructed vector ^ xv represents the set of visual features that an image should have according to the learned semantic relationships in the image collection. Therefore, ^ xv and xv highlight different visual structures of the same image, since ^ xv is a semantic approximation of the observed visual features, according to Eqs. (4) and (7). Notice that this extension can be applied to a latent semantic embedding using NMFA or to a direct semantic embedding using NSE. We refer to this extension using the suffix BP for Back-Projection (NMFA-BP or NSE-BP).

4.2.2. Controlling modality importance The parameter k in the convex combination of the fusion strategy (Eq. (6)) allows us to control the importance of each data modality. The problem of assigning more weight to one or the other modality mainly depends on the performance that each modality offers to solve queries. More specifically, it depends on how faithfully one modality represents the true content of an image. On the one hand, visual features may be inaccurate to represent high level semantic concepts, but good at representing low level visual arrangements. On the other hand, the semantic representation may be noisy or incomplete because of human errors or prediction discrepancies. The parameter k is split into two different parameters to consider two kinds of images: database images and query images. For both images, the semantic representations are predicted by the learned model. For database images, the parameter k will be called a and for query images it will be called b throughout the paper. This distinction allows us to control the importance of the semantic modality for new, unseen query images, taking into account that predicting an approximate semantic representation may have some inference noise. We evaluate the influence of such parameters in the following sections.

4.3. Histology image search

642

Indexing images by visual content in the retrieval system means that all searchable images in the collection, as well as all query images, are represented using the bag-of-features histogram, which is a non-parametric probability distribution of visual patterns. The latent semantic representation and the direct semantic representation are both nonnegative vectors that can be properly normalized making their ‘1 -norm equal to 1, so the new values are interpreted as the probabilities of high-level semantic concepts for one image. In the case of the fused representation, features are once again represented in the visual features space and the ‘1 normalization is applied as well. The retrieval system requires a similarity function to rank images in the collection by comparing them with the features of the query. The three representations discussed above can be considered as probability distributions, and the most natural way to compare these features is using a similarity measure appropriate for probability distributions. The histogram intersection is a measure for estimating the commonalities between two non-parametric probability distributions represented by histograms. It computes the common area between both histograms, obtaining a maximum value when both histograms are the same distribution and zero when there is nothing in common. The histogram intersection is defined as follows:

643 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665

666 n X k\ ðx; yÞ ¼ minfxi ; yi g

ð8Þ

i¼1

668

where x and y are histograms and the sub-index, i, represents the ith bin in each histogram of a total of n. This similarity measure has been shown to be a valid kernel function for machine learning applications [42], and has been successfully used in different computer vision tasks [43]. We adopt this similarity measure for image search in the three representation spaces evaluated in this work, which are the visual space, the semantic space and the fused space.

669

5. Experiments and results

676

Experiments to evaluate retrieval performance were conducted on the three histology data sets described in Section 2. In our experimental evaluation, we focus on image search experiments under the query-by-example paradigm, in other words, the use of non-annotated query images to retrieve relevant database images. Our goal is to demonstrate the strengths and weaknesses

677

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

644

670 671 672 673 674 675

678 679 680 681 682

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

8

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

684

of each of the three image representations: visual, semantic and multimodal.

685

5.1. Experimental protocol

683

686 687 688 689 690 691 692 693 694 695 696 697

Q3 5.1.1. Training, validation and test We follow a training, validation and test experimental scheme to setup and evaluate algorithms. A sample with 20% of the images in each collection is separated as held out data for testing experiments. The other 80% of the images are used for training and validation experiments. This partition is made using stratified sampling over the distribution of semantic terms, to separate representative samples of the semantic concepts in the data set. The number of images in each data set and the number of images in the corresponding partitions is reported in Table 1. For training and validation, a 10-fold cross validation procedure was employed and test experiments are conducted on held out data.

724

5.1.2. Performance measures A single experiment in a particular data set consists of a simulated query, i.e., an example image taken from the test or validation sets from which semantic annotations are known, but hidden. Then, the ranking algorithm is run over all database images and the list of ranked results is evaluated. The evaluation criterion adopted in our experiments is based on the assumption that one result is relevant to the query if both share one or more semantic terms. This assumption is reasonable under the query by example paradigm evaluated in this work. Since the system does not receive and explicit set of keywords, the query is highly ambiguous and the intention of the user may be implicit. It is even possible that the user is not completely aware of exactly what she is looking for. Therefore, the system does a good job if it retrieves images that are related to the query in at least one possible sense, helping the user to better understand image contents and supporting the decision making process. We performed automated experiments by sending a query to the system and evaluating the relevance of the results. The quality of the results list is evaluated using information retrieval measures, mainly Mean Average Precision (MAP), and precision at the first ten results (P@10 or early precision) [39]. For computing these measurements, we used the trec_eval tool, available online.4 These two measurements are complementary views of the performance of a retrieval system, and we found in our experiments that there is a trade-off when trying to maximize both at the same time, as is discussed below.

725

5.2. Baselines

726

The first natural baseline for image search under the query-byexample paradigm is the performance of purely visual search, that is, when no effort is done for introducing semantic information into the search process. In this case, the histogram intersection similarity measure is used directly to match query features to similar visual content from the database. This baseline allows to observe performance gains when using semantic information with a particular indexing algorithm. We also consider late fusion as a second baseline, since it can combine data from two different data sources. Late fusion is a very popular strategy for combining multiple similarity measures in a retrieval system thanks to its simplicity. In particular, a simple score combination has been shown to be robust both in theory [44] and practice [45,46] performing better than other schemes such as rank combination, minimum, maximum, and other operators. We adopt

698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723

727 728 729 730 731 732 733 734 735 736 737 738 739 740

4

http://trec.nist.gov/trec_eval/.

the score combination strategy using a convex combination of visual and semantic similarities to produce a single score for each image with respect to the query. Both similarities are first normalized using a min–max procedure. In addition, we optimize the parameter of the convex combination to produce the best baseline possible. In the experiments reported in this work, the semantic information used during late fusion is not provided by the user: it is automatically generated by the proposed methods. For any query image, we predict a semantic representation using the latent (NMFA) or direct (NSE) embeddings, and compute similarity scores with respect to database images. Also, similarity scores are computed using visual features independently. Then, both visual and semantic scores are combined to include the opinion of both views. Notice that this is fundamentally different to the proposed fusion by back-projection, since computing similarity scores in each space separately does not require any learning or modeling of multimodal relationships.

741

5.3. Trade-offs in retrieval performance

759

The first set of experiments focuses on performance evaluation for semantic and multimodal strategies. Results reported in this Section correspond to experiments conducted on the training and validation sets, following a 10-fold cross-validation procedure. In the following subsections we describe our findings on the tradeoff between MAP and P@10 when using semantic or visual search, and show how our algorithms can help to find a convenient balance.

760

5.3.1. Visual and semantic search The NMFA and NSE algorithms described in Sections 4.1.1 and 4.1.2 respectively, are used to project images to a semantic space spanned by terms. One of the main advantages of NSE, is that it does not need any parameter tuning during learning or inference, so it can be directly deployed in a retrieval system on top of the visual representation without further effort. For NMFA, the number of latent factors has to be chosen with cross-validation. We compare the performance of these two semantic indexing mechanisms in Table 2 for our three data sets. As baseline strategy we use visual matching, which is based on visual features only to retrieve related images. We also compare against the expected performance of a random ranking strategy, which was estimated using a Monte Carlo simulation. The chance performance is significantly lower than the visual baseline in all three data sets. Observe that the semantic embeddings always outperform the visual baseline by a significant amount in terms of MAP. This shows that the proposed embeddings learn to predict a good semantic representation of images. For the Cervical Cancer data set, the best embedding is NSE, while NMFA performs better on the Basal-cell Carcinoma and Histology Atlas data sets. This is consistent in terms of the complexity of semantic vocabularies for these data sets, since the Cervical Cancer set has a term dictionary of eight keywords, while the Basal-cell Carcinoma and Histology Atlas have 18 and 46 terms, respectively. Then, it is natural for a direct embedding to perform best with simple vocabularies, while the latent embedding is able to exploit complex correlations among many terms. Notice also that none of the semantic embeddings were able to improve the performance of early precision with respect to the visual baseline, according to the results in Table 2. Early precision is the measure of how many relevant images are shown in the first top ten results, and in this case, visual matching seems to be a strong baseline. Our result is consistent with previous studies that show how k-nearest neighbors algorithm in the visual space serves

768

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758

761 762 763 764 765 766 767

769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

Q4

Table 2 Retrieval performance of semantic strategies compared to visual matching. Results show the trade-off between MAP and P@10 on all image collections. Semantic search produces superior MAP while visual search is a strong baseline for early precision. Chance performance refers to the expected performance of a random ranking strategy, and it is significantly lower than visual and semantic search. Method

Cervical Cancer

Visual matching (baseline) Latent embedding (NMFA) Direct embedding (NSE) Chance performance

807 808 809 810 811 812 813 814 815 816 817

P@10

MAP

P@10

MAP

0.5904 0.5067 0.5414 0.4623

0.5214 0.6591 0.6970 0.4681

0.4360 0.3176 0.2543 0.2183

0.2928 0.4947 0.4317 0.1806

0.7372 0.5263 0.5230 0.0978

0.2751 0.6309 0.6113 0.1008

as a good baseline for image annotation [47]. This suggests that in the visual space, nearby images are more likely to have important correspondences of appearance patterns, which could result in some semantic relationship among them. This matching of visual patterns disappears when images are projected to the semantic space, and only dominant semantic concepts are preserved there. 5.3.2. Multimodal fusion To balance the trade-off between MAP and early precision, a multimodal fusion strategy may be used. The following experiments compare the ability of late fusion and early fusion by back-projection, to balance the performance of semantic and visual representations. Both strategies allow to control the relative importance of data modalities, using one parameter for database images and another for query images, as described in Section 4.2.

0.7

Histology Atlas

MAP

Each parameter is the weight of a convex combination between visual and semantic representations. The influence of these parameters was evaluated by producing fused representations with varying values between 0 and 1, with a step of 0.1, for a (database images) and b (query images). When evaluating the performance of each a; b pair, MAP and early precision (P@10) are measured. The parameter space was explored following this procedure on the three data sets using 10-fold cross-validation, to fuse visual features with the semantic representations obtained by the two proposed semantic embeddings (NMFA and NSE). Also, we followed the same parameter search procedure for the late fusion baseline, using both semantic embeddings as well. Fig. 5 presents the results of the multimodal fusion evaluation on the training set of the three histology image collections. These plots show the performance space with MAP on the x-axis and early precision

Pareto Frontier - Cervical Cancer Dataset

0.5

NSE-Late NMFA-Late NMFABP (15F) NSEBP Visual NMFA (15F) NSE

0.65

0.6

Pareto Frontier - Basal-cell Carcinoma Dataset NSE-Late NMFA-Late NMFABP (15F) NSEBP Visual NMFA (15F) NSE

0.45 0.4

P@10

806

Basal-cell C.

P@10

0.55

0.35 0.3

0.5

0.25 0.45 0.5

0.55

0.6

0.65

0.7

0.75

0.3

0.35

MAP

0.4

0.45

0.5

MAP

Pareto Frontier - Histology Atlas Dataset NSE-Late NMFA-Late NMFABP (15F) NSEBP Visual NMFA (15F) NSE

0.8 0.75 0.7

P@10

805

P@10

804

9

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

0.65 0.6 0.55 0.5 0.3

0.4

0.5

0.6

0.7

MAP

Fig. 5. Performance of multimodal fusion on the three histology image collections. Results obtained during parameter search by varying a and b to generate the multimodal representation. Plots show the Pareto frontier on the performance space for all multimodal fusion strategies while visual and semantic performance are presented as points. Reported methods are: Nonnegative Semantic Embedding + Late Fusion (NSE-Late), NMF-Asymmetric + Late Fusion (NMFA-Late), NMF-Asymmetric + Backprojection (NMFABP), Nonnegative Semantic Embedding + Backprojection (NSEBP), Direct Visual Matching (Visual), Latent semantic indexing using NMF-Asymmetric (NMFA), and Direct semantic indexing using Nonnegative Semantic Embedding (NSE).

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

818 819 820 821 822 823 824 825 826 827 828 829 830 831 832

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

10

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

863

(P@10) on the y-axis. Each point in this space corresponds to one configuration of the retrieval system, either with multimodal fusion or any other baseline. The performance is best when both performance measures are maximized simultaneously, and we only plot the result for a; b pairs in the Pareto frontier of each multimodal fusion strategy. First, notice the difference between visual and semantic search (NMFA and NSE) in all plots, and how these two points lie on opposite sides of the performance space, allowing to visualize their trade-off in performance. The plots also show that all multimodal fusion strategies can be configured to produce intermediate results between the performance of visual and semantic search by giving more preference to one or another data modality. These Pareto frontiers illustrate the path along the trade-off between visual and semantic search, showing that we cannot get a fused result that is better than both of them individually. However, the proposed strategy allows us to choose a good balance that preserves the best response of visual or semantic data as needed. Finally, our proposed early fusion, based on back-projection of the semantic data, consistently outperforms the results obtained by the late fusion baseline, achieving an improved performance in terms of both, MAP and P@10. The benefits of back-projection over late fusion are clear in all three image collections by a large margin. Two important characteristics make our strategy better than late fusion: first, during the back-projection step, we take advantage of the learned relationships between visual and semantic data, and explicitly encode these correspondences in a reconstructed vector (see Eqs. (6) and (7)). Second, we complement the visual representation with semantically reconstructed visual features, preserving the ability to match the original visual structures as well as semantic relations in the same space.

864

5.4. Histology image search

865

This Section presents the results of retrieval experiments conducted on the test set, after fixing the best parameters in the

833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862

866

training set for each model. The results compare the performance of visual, semantic and multimodal search, along with the proposed baselines. We keep our attention on the two performance measures evaluated in previous Sections, to contrast the benefits of each approach. The evaluated methods can be grouped in the following broader strategies: 1. Visual matching: Only visual features are used to match query images and database images. 2. Semantic search: Images are projected and matched in a semantic space. Two semantic indexing methods are evaluated: Nonnegative Semantic Embedding (NSE in Section 4.1) and NMF-Asymmetric (NMFA in Section 4.1.1). 3. Early fusion: Visual and semantic features are combined in the same image representation. Here we evaluated the proposed back-projection strategy with the two semantic embeddings: NSE Back-Projection (NSE-BP) and NMFA Back-Projection (NMFA-BP). 4. Late fusion: Visual and semantic features are matched independently and combined during the ranking procedure by mixing their similarity measures. We evaluated a late fusion strategy between visual features and the predicted semantic features of NSE and NMFA. These four groups facilitate the interpretation of results and also allows meaningful comparisons between different strategies. The results with respect to MAP are presented in Fig. 6 for each data set. This figure shows a ranking of the methods in decreasing order to compare relative gains in MAP. Notice that semantic methods are at the top positions of the rankings, indicating that semantic indexing is good at optimizing the precision of an image retrieval system. Precision measured by MAP can also be seen as the performance of image auto-annotation, and how faithful predictions of semantic terms for all images are. Then, the results suggest that semantic embeddings are able to predict meaningful annotations for images, which are then matched correctly. The second group (according to MAP) in the ranking of methods shown in Fig. 6 is the group of methods based on back-projection. The difference with semantic methods is mainly due to the trade-off discussed in the previous section. We selected fusion

Cervix

Carcinoma 37.72%

NSE NMF-A

NMF-A

NMFABP NSEBP

16.35%

NSEBP

NMFA-Late

16.02%

NSE-Late

VISUAL 0.45

5.88%

45.26% 41.48% 24.55%

NMFA-Late

0.00% 0.5

52.26%

NMFABP

21.25%

NSE-Late

79.28%

NSE

33.97%

19.37% 0.00%

VISUAL 0.55

0.6

0.65

0.7

0.75

0.15

0.25

0.2

MAP

0.3

0.35

0.4

0.45

MAP Histology

NMF-A

134.74%

NSE

128.71%

NMFABP

123.71% 121.76%

NSEBP NSE-Late

110.38%

NMFA-Late

106.94%

VISUAL

0.00%

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MAP Semantic

Early Fusion

Late Fusion

Visual

Fig. 6. Mean Average Precision (MAP) of all evaluated strategies on the three histology image collections. Bars are absolute MAP values, and percentages indicate the relative improvement with respect to the method with the lowest performance, which is the purely visual search in all cases. Semantic methods have the best overall performance, followed by the proposed early fusion methods. Note that the scale of performance has been set differently for each data set to highlight relative improvements. The overall tendency across data sets is best viewed in color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

11

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

Cervix

Carcinoma

NMFABP

13.25%

NMFABP

NMF-A

12.89%

NSEBP

NMFA-Late

11.74%

NSEBP VISUAL

NSE

34.28%

VISUAL

8.71%

25.54%

NMFA-Late

7.96%

NSE-Late

40.79% 35.62%

10.29%

NMF-A

6.05%

5.77%

NSE-Late

0.00%

0.00%

NSE

0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57

0.25

0.3

0.35

P@10

0.4

0.45

0.5

P@10 Histology

VISUAL

40.44%

NMFABP

24.11%

NSEBP

18.98%

NMFA-Late

16.50%

NSE-Late

15.86%

NMF-A

5.99%

NSE

0.00%

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

P@10 Semantic

Early Fusion

Late Fusion

Visual

Fig. 7. Early precision (P@10) of all evaluated strategies on the three histology image collections. Bars are absolute P@10 values, and percentages indicate the relative improvement with respect to the method with the lowest performance, which is the pure semantic search in all cases. The proposed fusion approach presents the best performance for two data sets, improving over the visual baseline. In all cases, our proposed method outperforms late fusion. Note that the scale of performance has been set differently for each data set to highlight relative improvements. The overall tendency across data sets is best viewed in color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935

parameters mainly to improve early precision (P@10) since our goal is to balance the trade-off for image retrieval, and the first results are crucial for a good user experience. Finally, notice that our proposed approach for fusion consistently outperforms the late fusion of similarities in all three data sets. As for early precision, Fig. 7 shows the relative differences among all evaluated methods. The ranking of methods has changed, leaving the semantic-based approaches in the bottom of performance, even below the visual baseline. In two of the three data sets, the proposed back-projection scheme gets the best performance, since we balanced the parameters during cross-validation to improve this measure. We wanted to bring more semantic information to the top of the ranked list of results, and our strategy proves to be effective for combining both data modalities in a single representation. The visual baseline is very strong in the case of the Histology Atlas data set, leaving behind all methods by a large margin. Nevertheless, fusion by back-projection offers a significant improvement over the original semantic representation, showing a good intermediate compromise. Our fusion methodology also shows important improvements over the late fusion baseline. Some queries are illustrated in Fig. 8 along with the top nine results retrieved by three methods: visual matching, a semantic embedding and the multimodal representation. Queries are single image examples with no text descriptions. The visual ranking brings images that match features without any knowledge about their high-level interpretations, and thus, sometimes fails to retrieve the correct results. The semantic embedding selected for each database corresponds to that with the best performance in the test set according to MAP (NSE for Cervical Cancer and NMFA for Basal-cell Carcinoma and Histology Atlas). Results obtained by matching the representation in the semantic space are diverse and correspond to images with higher scores in the terms

predicted for the query. This strategy clearly does not consider visual information for ranking images,5 which results in large variations of appearance. The ranking produced by the fused representation can improve the retrieval performance of the response and also produces more visually consistent results, since the fusion takes place in the visual space. This shows how the proposed approach can effectively introduce semantic information in the visual representation to bring correct images that respect visual structures. Finally, to bring a more general sense of the benefits of each approach, we consider a comparison of the methods with respect to the positions in the rankings of MAP and P@10 shown in Figs. 6 and 7. We use the average position of a method across the three different histology image collections, and re-rank them again to provide a unified comparison. Fig. 9 presents a visualization of the rankings with respect to MAP (on the x-axis) and P@10 (on the y-axis). An ideal method should be in the coordinate (1, 1), which means it ranked first regarding both performance measures. This visualization also reveals the trade-off between visual and semantic representations, indicating that on average, semantic methods get first when with respect to MAP. Fusion methods rank in the intermediate positions, indicating also that fusion by back-projection ranks on average above late fusion with respect to both performance measures, providing an improved balance. Also, notice how NMFA-BP usually ranks ahead the visual baseline in terms of P@10, and also in terms of MAP. Actually, the results suggest that NMFA-BP, which consists of a latent semantic embedding with a corresponding back-projection and fusion, produces the multimodal representation of images with the best compromise between early precision (standing in

5 Semantic embeddings use visual information to correctly project images to the semantic space.

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

12

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

Fig. 8. Example queries at left with the top nine images retrieved by visual, semantic and fused methods. Green frames indicate relevant images, and red frames indicate incorrectly retrieved images. Notice that semantic methods (NSE, NMFA) produce results with large visual variations since no visual information is considered for ranking. The proposed fusion approach (NSE-BP, NMFA-BP) improves the precision of the results and also brings a set of more visually consistent images. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 9. Rankings of evaluated methods according to their average position of performance among the three data sets. The x-axis represents the average ranking with respect to MAP, and the y-axis represents the average ranking with respect to P@10. Points close to the origin have better performance.

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

966

the first position of the rankings) and similar to the performance of semantic embeddings with respect to MAP.

967

5.5. Discussions

968

5.5.1. Multimodal fusion The proposed framework provides a learning-based tool for fusion of visual content and semantic information in histology images. The method models cross-modal relationships through semantic embeddings, which have the interesting property of making the two modalities exchangeable from one space to another. This property may be understood as a translation scheme between two languages that express the same concepts in different ways. One of the languages is visual, which communicates optical details found in images, and the other language is semantic, which represents high-level interpretations of images. These two views of the same data are complementary and are fused to build a better image representation. This paper presents an approach to the problem of histology image retrieval following a multimodal setup, which is the first of this kind reported in the literature. Previous work for semantic retrieval of histology images is mainly oriented to train classifiers for recognizing biological structures in images [7,8,22]. That strategy can be understood as a translation from the visual space to the semantic space without the possibility of a translation in the opposite way, and thus, limited to a fusion procedure based on late fusion only. Experimental results in this work have shown that an exclusive semantic representation may lead to the loss of important information for image search with example images. More importantly, our results show a consistent benefit of an early fusion strategy based on multimodal relationships over the popular approach of late fusion, and the benefits of early fusion are beyond improved performance. The implications of having a better image representation may be the starting point for other systems, that take this multimodal representation as input to learn classifiers or to solve other more complex tasks.

965

969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027

5.5.2. Query expansion effect The main reason for studying the fusion of visual and semantic data is that they are complementary sources of information: while visual data tends to be ambiguous, semantic data tends to be very specific; and while visual data provides detailed appearance descriptions, semantic data gives no clues on how an image looks like. So, depending on the fusion strategy, multimodal relationships become more useful for making decisions on data. Our setup for image retrieval considers example images as queries. Since the visual content representation used in this work is based on a bag of features, an analogy with text vocabularies may help to explain the effects of multimodal fusion. Visual features in the dictionary of codeblocks may be understood as visual words representing specific visual arrangements or configurations. One specific pattern is a low-level word that may have different meanings from a high-level or semantic perspective. This problem is known in natural language processing as polysemy and usually decreases retrieval precision, that is, the ability of the system to retrieve only relevant documents [48]. Also, different visual words may be related to the same high-level meaning, which is known as synonymy, and can reduce the ability of an information retrieval system to retrieve all relevant documents [48]. Experimental results in Section 5.4 are consistent with these definitions, since a visual polysemy effect is observed when the retrieval system is based on visual features only (the lowest MAP score). On the other hand, a visual synonymy effect is observed when using semantic data, with a higher MAP score but lower early

13

precision (P@10). Thus, the back-projection of semantic data is able to disambiguate visual words by introducing other visual words semantically correlated to the query, and so correcting the synonymy effect. In this context is when the visual query expansion effect takes place. Besides, when both modalities are combined, the polysemy effect can also be corrected, if appropriate weights are assigned.

1028

5.5.3. Large semantic vocabularies Previous work on histology image retrieval is mainly based on classifiers trained to recognize several biological structures [7,18,20,8,22,9]. To transfer these methodologies to real world system implementations, a significant tuning effort is required, since each classifier may have its own optimal configuration. The proposed method is a unified approach that integrates all semantic labels together in a matrix for learning multimodal relationships. This makes an implementation simpler and ready to scale up for new keywords, as long as corresponding example images are available. Our methods are adaptive to different sizes of vocabularies as shown in the experimental evaluations, which included three different histology databases of different sizes and different numbers of associated keywords. The effort of introducing new semantic terms in our model is virtually zero. Actually, our experiments show experimental analysis of histology images with the largest vocabulary reported so far. We believe that image retrieval systems have the potential to support clinical activities, and to achieve that, the underlying computational methods have to be very flexible and prepared to use semantic data, as it is available in current management systems. This involves vocabularies with hundreds of medical terms and thousands of images, which can be easily handled by the methods proposed in this paper.

1035

5.5.4. Histology image collections Currently, digital pathology allows to manage, share and preserve slides together with electronic health records, which are very important steps to modernize infrastructure and to provide improved services. However, these systems can go beyond passive repositories of data to actually help with the organization, search, visualization and discovery of information hidden or buried in histology image collections. This potential could benefit diagnostic activities, as well as scientific research and academic training, and to realize it, new tools and methodologies have to be designed and evaluated. Visual search technologies are among the most pervasive applications in daily life, and it could be seamlessly integrated in the practice of pathology as long as these methodologies fit the correct requirements for such an endeavor. This work proposed to build enhanced histology image representations to build effective retrieval systems, using visual and semantic features. The resulting representation can also be used for other different automated analysis systems, that could be essential in medical imaging departments to support various decisions in the clinical practice.

1059

5.5.5. Other considerations This paper has presented a study with experimental evidence in favor of an early fusion strategy. Even though the proposed algorithm for early fusion has shown improved performance, the final accuracy is still far from perfect and there are several opportunities for improvement, both in the technical and experimental sides. In the technical side, our early fusion algorithm may be understood as a procedure to learn an image representation given visual features and text annotations. Visual features have been learned in an unsupervised way following a bag-of-features approach, which has limited capacity to encode very complex visual patterns. Learning more powerful visual features may help to improve

1079

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

1029 1030 1031 1032 1033 1034

1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058

1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078

1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105

1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153

14

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

performance as suggested by several recent works [49,50]. Also, even though our early fusion algorithm is simple and efficient, it still requires more computations per image than late fusion algorithms. In the experimental side, one of the limitations of our study has been access to more annotated data. We conducted experiments on three different data sets with small to medium size. However, an indexing method like the one proposed in this paper could benefit from more data, which is difficult to collect from real medical cases and has restricted use in research or even in practical settings. We have shared part of the data collections used in this work with the hope that other researchers may benefit from open and high quality histology images, and we keep looking for opportunities to access more sources of information both in the community as well as internally within our own institutions.

6. Conclusions This work presented a framework to build histology image representations that combines visual and semantic features, following a novel early-fusion approach. The proposed method learns the relationships between both data modalities and uses that model to project semantic information back to the visual space, in which the fused representation is built. The resulting multimodal representation is used in an image search system that matches potential results using a similarity measure, however, its use can be extended to other histology image analysis tasks such as classification or clustering. The experimental evaluation conducted in this work included three histology image collections with various sizes and different numbers of text terms, demonstrating the potential of the proposed multimodal indexing methods under different conditions. We observed a trade-off between optimizing MAP and early precision at the same time when using either a semantic or a visual representation. This is mainly explained by the complementary nature of both data modalities.The proposed multimodal fusion approach is an effective strategy to balance this trade-off, and to improve the quality of image representations. Our methods consistently outperformed the visual matching and late fusion baselines in the image retrieval task, providing the best balance between visual and semantic search. We observed that, overall, semantic search strategies are very good at maximizing MAP, and our proposed strategies for early fusion can incorporate more visual information in the search process at the cost of small reductions in MAP performance. Fusion methods currently require further investigation on how to better utilize visual features to satisfy visual consistency or visual diversity criteria demanded by potential users, without decreasing the semantic meaningfulness of retrieved results. Semantic-based indexing can also be exploited using keyword-based search, instead of query-by-visual-example, which was the main search paradigm evaluated in this work. Keyword-based search may also be executed in a multimodal index, since by definition, it contains both information modalities: visual and semantic. Further potential research directions include the application of this representation to other image analysis tasks, such as image classification and automated grading. Also, since the formulation of our method can handle arbitrarily large semantic vocabularies, we are interested in extending its applicability to large scale biomedical image collections. We make an argument in favor of multimodal indexing, not only because of its potential to significantly improve relative performance, as we have shown in this paper, but also because this strategy has the ability to model different user interaction mechanisms, which could be adapted according to real needs.

Nevertheless, an additional intriguing question beyond indexing mechanisms is: what is the minimum required performance for image search technologies in a real clinical setting? The impact that an image retrieval system might have in health care is promising [13], but it will require more coordinated and collaborative efforts to be widely adopted. Machine learning tools are capable of empowering clinicians with timely and relevant information to make evidence-based decisions, which may result in improved quality of care for patients. This is currently a driving force for a large body of research.

1154

Acknowledgements

1164

The authors would like to thank the anonymous reviewers for their constructive comments, which helped to improve and clarify this manuscript. This work was partially funded by LACCIRMicrosoft project ‘‘Multimodal Image Retrieval to Support Medical Case-Based Scientific Literature Search’’.

1165

References

1170

[1] Kragel P, Kragel P. Digital microscopy: a survey to examine patterns of use and technology standards. In: Proceedings of the IASTED international conference on telehealth/assistive technologies. Anaheim (CA, USA): ACTA Press; 2008. p. 195–7. [2] Müller H, Michoux N, Bandon D, Geissbuhler A. A review of content-based image retrieval systems in medical applications–clinical benefits and future directions. Int J Med Inf 2004;73(1):1–23. [3] Datta R, Joshi D, Li J, Wang JZ. Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 2008;40(2):1–60. [4] Smeulders AW, Worring M, Santini S, Gupta A, Jain R. Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 2000;22(12):1349–80. [5] Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI. Matching words and pictures. J Mach Learn Res 2003;3:1107–35. [6] Rasiwasia N, Moreno PJ, Vasconcelos N. Bridging the gap: query by semantic example. IEEE Trans Multimedia 2007;9(5):923–38. [7] Tang HL, Hanka R, Ip HHS. Histological image retrieval based on semantic content analysis. IEEE Trans Inf Technol Biomed 2003;7(1):26–36. [8] Naik J, Doyle S, Basavanhally A, Ganesan S, Feldman MD, Tomaszewski JE, et al. A boosted distance metric: application to content based image retrieval and classification of digitized histopathology. SPIE Med Imag: Comput-Aided Diagn 2009;7260:72603F1–12. [9] Caicedo JC, Romero E, González FA. Content-based histopathology image retrieval using a Kernel-based semantic annotation framework. J Biomed Inf 2011;44:519–28. [10] Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS. Multimodal fusion for multimedia analysis: a survey. Multimedia Syst 2010;16(6):345–79. [11] La Cascia M, Sethi S, Sclaroff S. Combining textual and visual cues for contentbased image retrieval on the world wide web. In: 1998. Proceedings of IEEE workshop on content-based access of image and video libraries; 1998. p. 24–8. [12] Nuray R, Can F. Automatic ranking of information retrieval systems using data fusion. Inf Process Manage 2006;42(3):595–614. [13] Marchiori A. Automated storage and retrieval of thin-section ct images to assist diagnosis: system description and preliminary assessment. Radiology 2003;228:265–70. [14] Bonnet N. Some trends in microscope image processing. Micron 2004;35(8): 635–53. [15] Doyle S, Hwang M, Shah K, Madabhushi A, Feldman M, Tomaszeweski J. Automated grading of prostate cancer using architectural and textural image features. In: 4th IEEE international symposium on biomedical imaging: from nano to macro, 2007; 2007. p. 1284–7. [16] Zheng L, Wetzel AW, Gilbertson J, Becich MJ. Design and analysis of a contentbased pathology image retrieval system. IEEE Trans Inf Technol Biomed 2003;7(4):249–55. [17] Caicedo JC, Gonzalez FA, Romero E. A semantic content-based retrieval method for histopathology images. Inf Retriev Technol LNCS 2008;4993:51–60. [18] Orlov N, Shamir L, Macura T, Johnston J, Eckley DM, Goldberg IG. WNDCHARM: multi-purpose image classification using compound image transforms. Pattern Recogn Lett 2008;29(11):1684–93. [19] Tambasco M, Costello BM, Kouznetsov A, Yau A, Magliocco AM. Quantifying the architectural complexity of microscopic images of histology specimens. Micron 2009;40(4):486–94. [20] Caicedo JC, Cruz A, Gonzalez FA. Histopathology image classification using bag of features and kernel functions. In: Artif Intell Med. Springer; 2009. p. 126–35. [21] Mosaliganti K, Janoos F, Irfanoglu O, Ridgway R, Machiraju R, Huang K, et al. Tensor classification of N-point correlation function features for histology tissue segmentation. Med Image Anal 2009;13(1):156–66.

1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

1155 1156 1157 1158 1159 1160 1161 1162 1163

1166 1167 1168 1169

YJBIN 2171

No. of Pages 15, Model 5G

14 May 2014 Q1 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272

J.C. Caicedo et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx [22] Meng T, Lin L, Shyu M-L, Chen S-C. Histology image classification using supervised classification and multimodal fusion. In: 2010 IEEE international symposium on multimedia. IEEE; 2010. p. 145–52. [23] Müller H, Kalpathy-Cramer J. The ImageCLEF medical retrieval task at ICPR 2010. In: Proceedings of the 20th international conference on pattern recognition; 2010. p. 3284–7. [24] Kalpathy-Cramer J, Hersh W. Multimodal medical image retrieval: image categorization to improve search precision. In: Proceedings of the international conference on multimedia information retrieval. ACM; 2010. p. 165–74. [25] Rahman M, Antani S, Thoma G. A learning-based similarity fusion and filtering approach for biomedical image retrieval using SVM classification and relevance feedback. IEEE Trans Inf Technol Biomed 2011;15(4):640–6. [26] Müller H, Deselaers T, Deserno T, Clough P, Kim E, Hersh W. Overview of the ImageCLEFmed 2006 medical retrieval and medical annotation tasks. In: Evaluation of multilingual and multi-modal information retrieval. Springer; 2007. p. 595–608. [27] Müller H, Eggel I, Bedrick S, Radhouani S, Bakke B, Kahn Jr. C, et al. Overview of the CLEF 2009 medical image retrieval track. In: Cross Language evaluation forum (CLEF) working notes. [28] de Herrera AGS, Kalpathy-Cramer J, Demner-Fushman D, Antani S, Müller H. Overview of the ImageCLEF 2013 medical tasks. Working Notes of CLEF; 2013. [29] Müller H, de Herrera AGS, Kalpathy-Cramer J, Demner-Fushman D, Antani S, Eggel I. Overview of the ImageCLEF 2012 medical image retrieval and classification tasks. In: CLEF (Online Working Notes/Labs/Workshop); 2012. [30] Caicedo JC, BenAbdallah J, González FA, Nasraoui O. Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization. Neurocomputing 2012;76(1):50–60. [31] Fan J, Gao Y, Luo H, Keim DA, Li Z. A novel approach to enable semantic and visual image summarization for exploratory image search. In: Proceedings of the 1st ACM international conference on multimedia information retrieval. ACM; 2008. p. 358–65. [32] Romberg S, Lienhart R, Hörster E. Multimodal image retrieval. Int J Multimedia Inf Retriev 2012;1(1):31–44. [33] Putthividhy D, Attias HT, Nagarajan SS. Topic regression multi-modal latent Dirichlet allocation for image annotation. In: 2010 IEEE conference on computer vision and pattern recognition. IEEE; 2010. p. 3408–15. [34] Rusu M, Wang H, Golden T, Gow A, Madabhushi A. Multiscale multimodal fusion of histological and MRI lung volumes for characterization of lung inflammation. In: SPIE medical imaging, international society for optics and photonics; 2013. p. 86720X–86720X. [35] Meng T, Lin L, Shyu M-L, Chen S-C. Histology image classification using supervised classification and multimodal fusion. In: 2010 IEEE international symposium on multimedia. IEEE; 2010. p. 145–52.

15

[36] Vanegas JA, Caicedo JC, González FA, Romero E. Histology image indexing using a non-negative semantic embedding. Proceedings of the second MICCAI international conference on medical content-based retrieval for clinical decision support, vol. 7075. LNCS; 2012. p. 80–91 [chapter 8]. [37] Caicedo JC, Gonzalez FA, Triana E, Romero E. Design of a medical image database with content-based retrieval capabilities. Adv Image Video Technol LNCS 2007;4872:919–31. [38] Cruz-Roa A, Caicedo JC, González FA. Visual pattern mining in histology image collections using bag of features. Artif Intell Med 2011;52(2):91–106. [39] Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge University Press; 2008. [40] Hare JS, Samangooei S, Lewis PH, Nixon MS. Semantic spaces revisited: investigating the performance of auto-annotation and semantic retrieval using semantic spaces. In: Proceedings of the 2008 international conference on content-based image and video retrieval. New York (NY, USA): ACM; 2008. p. 359–68. [41] Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401(6755):788–91. [42] Barla A, Odone F, Verri A, Histogram intersection Kernel for image classification, international conference on image processing, 2003. In: Proceedings, vol. 3; 2003. p. 513–16. [43] Grauman K, Darrell T. The pyramid match kernel: discriminative classification with sets of image features. In: Tenth IEEE international conference on computer vision, 2005, vol. 2; 2005. [44] Hsu DF, Taksa I. Comparing rank and score combination methods for data fusion in information retrieval. Inf Retriev 2005;8(3):449–80. [45] Mc Donald K, Smeaton AF. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: Image and video retrieval. Springer; 2005. p. 61–70. [46] Lee JH. Analyses of multiple evidence combination. In: Special interest group on information retrieval. ACM SIGIR conference, vol. 31. ACM; 1997. p. 267–76. [47] Makadia A, Pavlovic V, Kumar S. A new baseline for image annotation. In: Proceedings of the 10th European conference on computer vision. Berlin, Heidelberg: Springer-Verlag; 2008. p. 316–29. [48] Carpineto C, Romano G. A survey of automatic query expansion in information retrieval. ACM Comput Surv 2012;44(1):1–50. [49] Cruz-Roa A, Arevalo JE, Madabhushi A, González FA. A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. In: Medical image computing and computer-assisted intervention–MICCAI 2013. Springer; 2013. p. 403–10. [50] Wang H, Cruz-Roa A, Basavanhally A, Gilmore H, Shih N, Feldman M, et al. Cascaded ensemble of convolutional neural networks and handcrafted features for mitosis detection; 2014.

Q1 Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317

Histology image search using multimodal fusion.

This work proposes a histology image indexing strategy based on multimodal representations obtained from the combination of visual features and associ...
3MB Sizes 2 Downloads 4 Views