Literature-based biomedical image classification and retrieval.

G Model CMIG-1279; No. of Pages 11

ARTICLE IN PRESS Computerized Medical Imaging and Graphics xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Computerized Medical Imaging and Graphics journal homepage: www.elsevier.com/locate/compmedimag

Literature-based biomedical image classification and retrieval Matthew S. Simpson, Daekeun You, Md Mahmudur Rahman, Zhiyun Xue, Dina Demner-Fushman ∗ , Sameer Antani, George Thoma Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

a r t i c l e

a b s t r a c t

i n f o

Article history: Received 14 December 2013 Received in revised form 6 June 2014 Accepted 10 June 2014 Keywords: Image-based retrieval Case-based retrieval Modality classification Compound figure separation

Literature-based image informatics techniques are essential for managing the rapidly increasing volume of information in the biomedical domain. Compound figure separation, modality classification, and image retrieval are three related tasks useful for enabling efficient access to the most relevant images contained in the literature. In this article, we describe approaches to these tasks and the evaluation of our methods as part of the 2013 medical track of ImageCLEF. In performing each of these tasks, the textual and visual features used to represent images are an important consideration often left unaddressed. Therefore, we also describe a gradient-based optimization strategy for determining meaningful combinations of features and apply the method to the image retrieval task. An evaluation of our optimization strategy indicates the method is capable of producing statistically significant improvements in retrieval performance. Furthermore, the results of the 2013 ImageCLEF evaluation demonstrate the effectiveness of our techniques. In particular, our text-based and mixed image retrieval methods ranked first among all the participating groups. Published by Elsevier Ltd.

1. Introduction Images are sources of essential information within biomedical literature. Biomedical images are useful for a variety of purposes, including research and education, and their content often conveys information that is not otherwise mentioned in the surrounding text of an article. Given the rapid pace of scientific discovery, it is increasingly difficult to locate informative images within large volumes of literature. In order to enable efficient access to the most relevant images for a given request, various text-based and content-based image classification and retrieval approaches, as well as multimodal methods that combine textual and visual strategies, have been proposed and evaluated within the biomedical domain. Existing classification and retrieval approaches differ in how they represent images and determine their relevance. Text-based methods represent images with descriptive text, such as figure captions, and retrieve or classify images using traditional textbased techniques. Content-based methods represent images using visual descriptors and use these descriptors to retrieve or classify images based on their visual appearance. Visual descriptors describe various aspects of an image’s appearance, such as its

∗ Corresponding author. Tel.: +1 3014355320. E-mail address: [email protected] (D. Demner-Fushman).

color or texture. Finally, multimodal techniques represent images with both textual and visual features. For performing retrieval tasks, multimodal approaches allow for the construction of complex information requests consisting of a textual description of the desired image content that is augmented with the visual descriptors extracted from one or more example images. Datta et al. [1] provide a general overview of existing image classification and retrieval approaches. In this article, we describe our literature-based biomedical image classification and retrieval methods and their evaluation as part of the 2013 medical track of ImageCLEF [2]. The medical track of ImageCLEF consists of an image modality classification task, a compound figure separation task, and retrieval tasks. For the classification task, the goal is to classify a given set of images according to 31 modalities (e.g., “Computerized Tomography,” “Electron Microscopy,” etc.). The modalities are organized hierarchically into meta-classes such as “Radiology” and “Microscopy,” which are themselves types of “Diagnostic Images.” For the compound figure separation task, the goal is to segment the panels of multi-panel figures. Figures contained in biomedical articles are often composed of multiple panels (e.g., commonly labeled “a,” “b,” etc.) and segmenting them can result in improved retrieval performance. Finally, in the image retrieval task, a set of ad-hoc information requests is given, and the goal is to retrieve the most relevant images from a collection of biomedical articles for each topic. Although the 2013 medical track of ImageCLEF includes an additional case-based

http://dx.doi.org/10.1016/j.compmedimag.2014.06.006 0895-6111/Published by Elsevier Ltd.

Please cite this article in press as: Simpson MS, et al. Literature-based biomedical image classification and retrieval. Comput Med Imaging Graph (2014), http://dx.doi.org/10.1016/j.compmedimag.2014.06.006


ARTICLE IN PRESS M.S. Simpson et al. / Computerized Medical Imaging and Graphics xxx (2014) xxx–xxx

2

retrieval task, we will focus in this article only on the image retrieval task. An important consideration for each of the above tasks that is often left unaddressed is the relative effectiveness of the various textual and visual features used to represent images. For example, for some biomedical image retrieval tasks, the content of an image’s caption might be the most significant indicator of its relevance, whereas for other tasks, an image’s color might be the most important feature. Information classification and retrieval systems commonly provide a mechanism to “weight” these features as a means of specifying their perceived significance. However, lacking knowledge of the images that are relevant for a specific set of requests, it is common practice to adjust the feature weights manually based on domain experience rather than determining an ideal allocation automatically. Insight can be gained into literature-based biomedical image classification and retrieval tasks by optimizing the weights a multimodal system allocates to various textual and visual features. Given the well-known challenges associated with determining meaningful multimodal feature combinations [3], feature weight optimization has the potential to result in a significant improvement in image classification and retrieval performance. In addition, an analysis of the most effective features may lead to more sophisticated textual and visual feature integration strategies. In addition to describing our basic approaches for each of the tasks of the 2013 medical track of ImageCLEF, we also describe in this article a feature optimization strategy for the image retrieval task. Much existing work addresses the problem of optimizing the internal parameters of traditional text-based information retrieval systems. Researchers have demonstrated success using a variety of optimization techniques, some of which include direct and interactive optimization, the use of genetic algorithms [4], and gradient-based methods. Among the many proposed strategies, gradient-based techniques, such as those proposed by Taylor et al. [5] and Chapelle and Wu [6], have received considerable attention for their efficiency and ease of implementation. However, to our knowledge, similar techniques have yet to be explored within the context of multimodal biomedical image retrieval. In the following sections, we describe our methods and results. In Section 2, we describe the textual and visual features we use to represent the images contained in biomedical articles and how we organize them in a structure useful for image classification and retrieval tasks. Our approaches are then described for the compound figure separation, modality classification, and image retrieval tasks. After presenting our basic image retrieval methods, we describe in greater detail our image retrieval system and the gradient-based optimization strategy we used to optimize its internal parameters. Also described in Section 2 are each of the runs we submitted as part of the ImageCLEF evaluation. Finally, in Sections 3–4 we present and discuss the results of our image classification and retrieval methods.

Table 1 Extracted visual descriptors.

2. Methods

2.1.2.1. Cluster words. Content-based image retrieval systems commonly define visual similarity as a distance between extracted visual descriptors. Unfortunately, computing this distance exactly can be a expensive operation that increases the response time of a retrieval system. In order to avoid the computational complexity of computing these distances, we create a textual representation of the descriptors that we integrate with our existing textual features following the global feature mapping approach of Simpson et al. [16]. For each descriptor, the method clusters the feature vectors extracted from all the images in the collection using a hierarchical version of the k-means++ algorithm [17]. The method then assigns

In this section, we describe our compound figure separation, modality classification, and image retrieval methods. However, before doing so, we first describe the textual and visual features with which we represent images. In presenting our image retrieval approach, we also describe how we optimize the internal parameters of our biomedical image retrieval system in order to determine effective combinations of these features. Lastly, we describe the runs we submitted for each task of the ImageCLEF evaluation.

No.

Descriptor

1. 2. 3.

Autocorrelation Edge frequency Fuzzy color and texture histograma (FCTH) [7] Gabor momenta Gray-level co-occurrence matrix moment (GLCM) [8] Local binary pattern (LBP1 ) [9] Local binary pattern (LBP2 ) [9] Scale-invariant feature transformationa (SIFT)[10] Shape moment Tamura momenta [11] Edge histograma (EHD) [12] Color and edge directivitya (CEDD) [13] Primitive length Color layouta (CLD) [12] Color moment Semantic concept (SCONCEPT) [14] Combined

4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. a

Dimensionality 25 25 192 60 20 256 256 256 5 18 80 144 5 16 3 30 1391

Feature computed using the Lucene Image Retrieval library [15].

2.1. Image representation The images contained in biomedical articles are represented using a combination of the textual and visual features described below. 2.1.1. Textual features Our textual features include both document-level features and figure-level features. Document-level features include the title, abstract, and MeSH® terms (Medical Subject Headings) of the article in which an image appears. MeSH is a controlled vocabulary created by U.S. National Library of Medicine (NLM) to index biomedical articles. Because we expect some of the MeSH terms assigned to an article to be more relevant than others to the images contained therein, we divide an article’s MeSH terms into two groups. One group contains terms designated “major” by the NLM indexers and another group contains terms designated “minor” by the indexers. MeSH terms are extracted from the metadata associated with each article. Our figure-level features include an image’s caption and its “mentions,” which are snippets of text within the body of the article that discuss the image. Mentions are identified within the body of an article by hyperlinks to the images to which the snippets of text refer. 2.1.2. Visual features In addition to the above textual features, we also represent the visual content of images using various low-level visual descriptors. Table 1 summarizes the descriptors we extract and their dimensionality. Due to the large number of these features, we forego describing them in any detail. However, they are all well-known and discussed extensively in existing literature.




3

Fig. 1. The portion of the enriched citation relating to Fig. 2a from “Rare clinical experiences for surgical treatment of melanoma with osseous metastases in Taiwan” by Huang et al. [32].

each cluster of features a unique alphanumeric code word and represents each image as a bag of these “visual words.” Although the details of this process are omitted here for brevity, the method produces, for each image, a textual signature of our visual features that can be indexed and searched using traditional text-based information retrieval systems.

2.1.2.2. Attribute selection. An orthogonal approach to transforming our visual descriptors into a computationally manageable representation is attribute selection. By eliminating unneeded or redundant information, these techniques can also improve our modality classification and image retrieval methods. Attribute selection is performed using the WEKA [18] data mining software. First, we group all our visual descriptors into the single combined vector given in Table 1, and we then perform attribute selection to reduce the dimensionality of this combined feature to less than two hundred attributes. Attribute selection is investigated as an alternative to clustering. Thus, we do not compute cluster words for the lower dimensional features.

2.2. Enriched citations For performing retrieval tasks, we organize the textual features and cluster words of each image in an article as a structured document called an enriched citation. Fig. 1 shows a portion of the enriched citation produced for the article entitled “Rare clinical experiences for surgical treatment of melanoma with osseous metastases in Taiwan” by Huang et al. [32]. The portion depicted describes the textual and visual features of panel “a” of Fig. 2, which is an x-ray showing a fractured tibia. For simplicity, we depict the cluster words of only part of the visual features enumerated in Table 1. The images comprising multimodal information requests are represented in a similar structure. Fig. 2 shows Topic 7 from the 2011 ImageCLEF medical retrieval track data set. The textual description of the topic (“x-ray images of a tibia with a fracture”) is depicted along with two representative images. Visual descriptors are extracted from each representative image and map them to their corresponding cluster words using the global feature mapping process mentioned previously. Thus, the goal of




4

Fig. 2. The textual description and representative images for Topic 7 from the 2011 ImageCLEF data set as well as the cluster words of some visual descriptors extracted from the images.

our multimodal image retrieval system is to combine the textual description of the topic with the cluster words computed for each representative image, and then to use this information to query an index of enriched citations, such as the one shown in Fig. 1.

2.4. Modality classification Experiments were performed with both flat and hierarchical modality classification methods. Below we describe our flat classification strategy, an extension of this approach that exploits the hierarchical structure of the classes, and a post-processing method for improving the classification accuracy of illustrations.

2.3. Compound figure separation Often, the biomedical figures are composed of multiple images (commonly labeled “a,” “b,” “c,” etc.). For each multi-panel figure, we segment the figure into its constituent images following the method of Apostolova et al. [19], and we create a separate entry for each panel in the enriched citation of the article containing the figure. For each panel, we include in the enriched citation an additional field representing the panel’s “sub-caption,” the portion of the figure’s caption that specifically refers to the panel. Subcaptions for each panel of a multi-panel figure are extracted using the aforementioned panel segmentation method, which, in addition to segmenting the images, also segments their captions. The caption segmentation method uses regular expressions to locate panel labels such as “(a)” or “(a–c)” in captions, and it then segments the caption into panel-specific sub-captions using the labels as delimiters.

2.4.1. Flat classification Fig. 3 provides an overview of our basic classification approach. Multi-class support vector machines (SVMs) are utilized as our flat modality classifiers. First, we extract our visual and textual image features from the training images (representing the textual features as term vectors). Then, we perform attribute selection to reduce the dimensionality of the features. The lower-dimensional vectors are constructed independently for each feature type (textual or visual) and combine the resulting attributes into a single, compound vector. Finally, we use the lower-dimensional feature vectors to train multi-class SVMs for producing textual, visual, or mixed modality predictions. 2.4.2. Hierarchical classification Unlike the flat classification strategy described above, it is possible to exploit the hierarchical organization of the modality classes

Fig. 3. Classifier organization.


G Model

ARTICLE IN PRESS

CMIG-1279; No. of Pages 11

M.S. Simpson et al. / Computerized Medical Imaging and Graphics xxx (2014) xxx–xxx

5

Modality classfication

Illustration Single

General COMP

COMP

Single

GTAB GPLI

Radiology_3D

GFIG GSCR GFLO GSYS

Photo

Microscopy

DRUS

DVDM

DMLI

DRMR

DVEN

DMEL

DRCT

DVOR

DMTR

DRAN

GNCP

DMFL

DRXR GGEN DRPE GGEL DRCO GCHE D3DR

GMAT GHDR DSEE DSEC DSEM

Fig. 4. Revised modality hierarchy.

in order to decompose the task into several smaller classification problems that can be sequentially applied. Based on our visual observation of the training samples and our initial experiments, we modified the original modality hierarchy [2] proposed for the task. The hierarchy we used for our experiments is shown in Fig. 4. In our modified hierarchy, we first separate the Illustration meta-class from the other classes. Because compound images can be composed of either illustrations or general images, we include Class COMP under both the Illustration and General meta-classes. Similarly, all non-compound images belong to the Single metaclass under Illustration or General. The general, non-compound images are divided into the Radiology 3D, Photo, and Microscope meta-classes. The images belonging to each of these categories are acquired by an identical or similar imaging technology or have similar visual features. Flat multi-class SVMs are trained, as shown in Fig. 3, for each meta-class. For recognizing compound images, we utilize the algorithm proposed by Apostolova et al. [19], which detects sub-figure labels and the border of each sub-figure within a compound image. To arrive at a final class label, an image is sequentially classified beginning at the root of the hierarchy until a leaf class can be determined. 2.4.3. Illustration post-processing Because our initial classification experiments resulted in only modest accuracy for the fourteen “Illustration” classes shown in Fig. 4, we concluded that our current textual and visual features may not be sufficient for representing these figures. Therefore, in addition to the aforementioned machine learning modality classification methods, we also developed several complimentary rule-based strategies [20] for increasing the classification accuracy of “Illustration” classes. A majority of the training samples contained in the “Illustration” meta-class, unlike other images in the collection, consist of line drawings or text superimposed on a white background. Because of this regularity, we can utilize rules built on the knowledge of such class-specific features in post-processing our initial classification

results. For example, program listings mostly consist of text; thus, the use of text and line detection methods may increase the classification accuracy of Class GPLI. Similarly, polygons (e.g., rectangles, hexagons, etc.) contained in flowcharts (GFLO), tables (GTAB), system overviews (GSYS), and chemical structures (GCHE) are a distinctive feature of these modalities. The methods of Jung et al. [21] and OpenCV1 functions are utilized to assess the presence of text and polygons, respectively. 2.5. Image retrieval In this section we describe our textual, visual and mixed approaches to the ad-hoc image retrieval task. Descriptions of the submitted runs that utilize these methods are presented in Section 2.6. 2.5.1. Textual approaches To allow for efficient retrieval, we index our enriched citations with the Essie [22] biomedical information retrieval system. Essie is a probabilistic retrieval system for structured documents developed by the NLM to support ClinicalTrials.gov [23], an online registry of clinical research studies. Being specifically designed for biomedical text, Essie automatically expands query terms along the synonymy relationships in the Unified Medical Language System® (UMLS® ) [24] Metathesaurus.® Essie is described in more detail in Sections 2.5.4–2.5.6, as well as how we use previous ImageCLEF collections to optimize Essie’s parameters. Each topic description is organized into a frame-based (e.g., PICO2 ) representation following the method similar to that described by Demner-Fushman and Lin [25]. Extractors identify concepts related to problems, interventions, age, anatomy, drugs,

1

http://opencv.willowgarage.com/wiki/. PICO is a mnemonic for structuring clinical questions in evidence-based practice and represents Patient/Population/Problem, Intervention, Comparison, and Outcome. 2


G Model CMIG-1279; No. of Pages 11 6


and modality. Also identified are modifiers of the extracted concepts and a limited number of relationships among them. The extracted concepts are then transformed into Essie queries. To construct a query for each topic, we create and combine several Boolean expressions derived from the extracted concepts. First, we create an expression by combining the concepts using the AND operator (meaning all of the concepts are required to occur in an image’s enriched citation), and then we produce additional expressions by allowing an increasing number of the extracted concepts to be optional. Additionally, in many of our retrieval runs we also include a bag-of-words representation of the topic description as a component of a query. All of these expressions are then combined using the OR operator. Our retrieval system ranks images retrieved using a more specific expression (e.g., an expression with no optional concepts) higher than those images retrieved using a less specific expression (e.g., the bag-of-words representation of the topic description). The resulting queries are used to search the Essie indices. 2.5.2. Visual approaches Our visual approaches to image retrieval are based on retrieving images that appear visually similar to the given topic images. For our basic method, we compute the visual similarity between two images as the Euclidean distance between their visual descriptors. For the purposes of computing this distance, we represent each image as a combined feature vector composed of a subset of the visual descriptors listed in Table 1 after attribute selection. In order to avoid the online computation of the above similarity metric, for some of our submitted visual retrieval runs, we utilize Essie to retrieve images based on the cluster words stored in their enriched citations. Retrieval is performed by first extracting a query image’s visual features, then by determining the features’ cluster membership, and finally by combining the unique “words” assigned to the clusters containing the features in order to form a textual query. For a given topic, we combine the textual interpretations of all features for all sample images using the OR operator. 2.5.3. Mixed approaches Several methods of combing our textual and visual approaches are explored. One such approach involves the use of our image cluster words. For performing multimodal retrieval using cluster words, we first extract the visual descriptors listed in Table 1 from each example image of a given topic. The clusters to which the extracted descriptors are nearest are then located in order to determine their corresponding cluster words. Finally, we combine these cluster words with words taken from the topic description to form a multimodal query appropriate for Essie. While the use of cluster words allows us to create multimodal queries, we can instead directly combine the independent outputs of our textual and visual approaches. In a score merging approach, we apply a min–max normalization to the ranked lists of scores produced by our textual and visual retrieval strategies. The normalized scores given to each image are then linearly combined to produce a final ranking. The weights are based on the mean average precision of our best performing visual and textual approaches from the 2012 ImageCLEF evaluation. Similarly, a rank merging approach combines the results of our textual and visual approaches using the ranks of the retrieved images instead of their normalized scores. To produce the final image ranking using this strategy, we re-score each retrieved image as the reciprocal of its rank and then repeat the above procedure for combining scores. Another means of incorporating visual information with our retrieval approaches is through the use of a modality classifier. Using our hierarchical modality classification approach, we can first determine the most probable modalities for a topic’s example images. After retrieving a set of images using either our textual

or visual methods, we can eliminate retrieved images that are not of the same modality as the topic images. An advantage of performing hierarchical classification is that we can filter the retrieved results using the meta-classes within the hierarchy (e.g., “Radiology”). Finally, for some retrieval submissions, we combine the retrieval results produced by several queries into a single ranked list of images. This query combination, or padding, is performed by simply appending the ranked list of images retrieved by a subsequent query to the end of the ranked list produced by the preceding query. The goal of this approach is to improve retrieval recall under the assumption that all the images retrieved by one query are more relevant than those retrieved by subsequent queries. 2.5.4. The Essie biomedical information retrieval system The Essie information retrieval system is utilized for efficiently retrieving enriched citations for textual, visual, and mixed retrieval approaches. Essie scores our enriched citations according to a variation of the query likelihood model [26]. Thus, Essie ranks each document by the probability of the document given a specific query. A “document” to Essie is a set of fields, or search areas, comprising the textual and visual features described previously. Essie assumes term occurrences in one field are independent of term occurrences in other fields, although the occurrences may not be mutually exclusive (i.e., the same term can occur simultaneously in multiple fields). If we let D = ∪ni=1 Fi be a document composed of n fields, then the score Sd Essie assigns to document D for a query Q is defined by: Sd (D, Q ) = P (D|Q ) Sd (D, Q ) = P Sd (D, Q ) =

n

n

i=1

Fi |Q

(−1)k−1

(1)

(2)

P

i∈I

I ⊂ {1, . . ., n}

k=1

Fi |Q

(3)

|I|=k

Sd (D, Q ) =

n k=1

(−1)k−1

I ⊂ {1, . . ., n}

i∈I

wi Sf (Fi , Q )

(4)

|I|=k

where 0 ≤ wi , Sf ≤ 1 and n = |D|. Eq. (1) in the above function is the assumption of the query likelihood model: documents are scored according to their probability given the query. Eq. (2) results from the definition of D as a set of fields, and Eq. (3) is the principle of inclusion and exclusion applied to probabilities. Eq. (4) results from the field independence assumption mentioned above. P(Fi |Q) is also replaced with wi Sf (Fi , Q ) in Eq. (4), which is the probability of field Fi given the query Q weighted by the constant wi . Thus, given a score Sf for each field, the document score depends only on the field weights. Although we do not address the computation of field scores in our current work, the score Essie assigns to field Fi is obtained by setting wi = 1 and computing retrieval results for documents consisting only of Fi . In such a scenario, the document and field score are equal. 2.5.5. Gradient-based optimization of Essie field weights For a given retrieval task, we can affect the ranking of documents by manually adjusting Essie’s field weights. For example, for text-based image retrieval, we might assume image captions to be an important feature and weight the caption fields of our enriched citations more heavily than the other textual fields. Similarly, for content-based retrieval, we might assume color to be an important feature and weight the CLD fields more heavily. However, we in general do not know the optimal allocation of weights for either of these scenarios or how the textual and visual features should be


G Model

ARTICLE IN PRESS

CMIG-1279; No. of Pages 11

M.S. Simpson et al. / Computerized Medical Imaging and Graphics xxx (2014) xxx–xxx

weighted relative to one another for multimodal queries. It would be desirable to optimize the weights we use for text-based, contentbased, and multimodal retrieval tasks, but, given the number of fields in our enriched citations, directly optimizing their weights is computationally infeasible using brute-force methods. Gradientbased optimization strategies are more desirable than brute-force methods, but their application within information retrieval has several well-known limitations. First, gradient-based methods are not guaranteed to find a global optimum. However, this drawback can be overcome by repeating the optimization with different initializations and choosing the best solution. Second, the function being optimized must be differentiable. Unfortunately, commonly used information retrieval metrics, such as mean average precision, are not smooth functions and cannot be optimized directly. Instead, surrogate functions that approximate the desired retrieval metric must be used, and this topic has been the subject of much research. Previous experiments [5] have shown the RankNet [27] cross entropy cost function to be a reasonable and computationally efficient surrogate for common retrieval metrics, and we utilize this function in our current work. Below we briefly describe the function, following closely the description of RankNet by Metzler [28]. The RankNet cost function is computed over pairwise document preferences for a set of training queries. The set of preferences for a training query Q is given by RQ where (D1 , D2 ) ∈ RQ implies that document D1 should be ranked higher than document D2 . RQ can be easily constructed for each query in a set of relevance judgements produced as a result of community-wide information retrieval evaluations such as ImageCLEF. Given all the pairwise preferences for a set of queries Q, the RankNet cost function is defined as: C(Q, R) =

ln(1 + eY )

2.6. Submitted runs

where Y = Sd (D2 , Q) − Sd (D1 , Q). Thus, the cost C is minimized by maximizing the difference in document scores that Essie assigns to relevant and non-relevant documents. In order to minimize C using gradient-based methods, we must compute the partial derivative of the cost function with respect to the field weights. For a weight wj , the partial derivative of the cost function is computed as:

Q ∈Q (D1 ,D2 )∈RQ

∂C ∂Y ∂Y ∂wj

(6)

where the above function follows from the application of the chain rule for derivatives. ∂C/∂Y is computed as:

∂C eY = 1 + eY ∂Y

2.5.6. Evaluation of optimized weights Before performing our image retrieval experiments for the 2013 ImageCLEF evaluation, we validated the effectiveness of the above optimization strategy on the 2011 and 2012 ImageCLEF medical retrieval track data sets. The 2011 collection was used for optimizing Essie’s field weights and the 2012 collection for measuring any improvement in retrieval performance. The 2011 and 2012 collections are nearly identical, containing roughly 300,000 images taken from a portion of the articles in the open access subset of PubMed Central.® The only substantial difference between the two collections is the retrieval topics that were judged as part of each evaluation. There were 30 multimodal topics in the 2011 collection, including the one shown in Fig. 2, and 22 topics in the 2012 collection. For each collection, we produced a set of enriched citations, such as the one shown in Fig. 1. Essie was then used to index each of the collections. Essie’s field weights were optimized for various content-based, text-based, and multimodal retrieval strategies. Our content-based methods utilize the textual signatures of five visual descriptors from Table 1 (CEDD, CLD, EHD, FCTH, and SCONCEPT) that we extract from query images and use to search the visual fields of our enriched citations. Similarly, our text-based methods only use the textual description of a topic to search over the textual fields. Finally, our multimodal retrieval strategies combine the textual description of a topic with the visual features of its representative images to search all fields of our enriched citations. After optimizing the field weights for each of the above runs on the 2011 ImageCLEF topics, we validated Essie’s performance using the same weights on the 2012 topics. The optimized weights were then used for performing our image retrieval experiments on the 2013 ImageCLEF collection.

(5)

Q ∈Q (D1 ,D2 )∈RQ

∂C = ∂wj

7

(7)

Finally, by differentiating the Essie document score with respect to the field weight, we compute ∂Y/∂wj as:

∂Y ∂Sd (D2 , Q ) ∂Sd (D1 , Q ) = − ∂wj ∂wj ∂wj

(8)

∂Y = Sf (Fj , Q ) Sd (D1 \ Fj , Q ) − Sd (D2 \ Fj , Q ) ∂wj

(9)

For a document D, the derivative of the document score with respect to weight wj evaluates to the field score for Fj multiplied by one minus the document score of D absent field Fj . Because each wj is bounded by 0 ≤ wj ≤ 1, our gradient-based optimization method must respect these constraints. For this work, we minimize the RankNet cost function using the bound-constrained, limited-memory, Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS-B) [29].

In this section we describe each of our submitted runs for the modality classification, compound figure separation, image retrieval tasks. Each run is identified by its official file name or trec eval run ID and mode (textual, visual or mixed). All submitted runs are automatic, meaning they are performed in batch without any human intervention. 2.6.1. Modality classification runs M1. nlm textual only flat (textual): A flat multi-class SVM classification using selected attributes from a combined term vector created from four textual features (article title, MeSH terms, and image caption and mention). M2. nlm visual only hierarchy (visual): A hierarchical multi-class classification using selected attributes from a combined visual descriptor of features 1–15 of Table 1. M3. nlm mixed hierarchy (mixed): A hierarchical multi-class SVM classification combining Runs M1 and M2. Textual and visual features are combined into a single feature vector for each image. M4. nlm mixed using 2012 visual classification (mixed): A combination of Runs M1 and M2 but using models trained on the 2012 ImageCLEF medical modality classification training set. Images are first classified according to Run M1. Images having no textual features are classified according to Run M2. Our compound figure separation method is used as a pre-classifier to identify all multi-panel images. M5. nlm mixed using 2013 visual classification 1 (mixed): Like Run M4 using the 2013 ImageCLEF medical modality classification data set. M6. nlm mixed using 2013 visual classification 2 (mixed): Like Run M5 using all visual features from Table 1.




8

2.6.2. Compound figure separation runs S1. nlm multipanel separation (mixed): A combination of figure caption analysis, panel border detection, and panel label recognition. 2.6.3. Image retrieval runs A1. nlm-image-based-textual (textual): A combination of two queries using Essie. (A.Q1) A disjunction of modality terms extracted from the query topic must occur within the caption or mention fields of an image’s textual features; a disjunction of the remaining terms is allowed to occur in any field. (A1.Q2) A lossy expansion of the verbatim topic is allowed to occur in any field. A2. nlm-image-based-visual (visual): A disjunction of the query images’ clustered visual descriptors must occur within the global image feature field. A3. image latefusion merge (visual): An automatic content-based image retrieval approach. In this approach, features 10–16 of Table 1 are used, and their individual similarity scores are linearly combined with predefined weights. The weights are normalized based on the cross validation accuracy they each independently achieve in constructing our modality classifier. All images in each topic are considered and result lists for each topic are combined to produce a single list of retrieved images. A4. image latefusion merge filter (visual): Like Run A3 but the search is performed after removing from the collection all images classified into a modality class other than those represented by the query images. A5. latefusion accuracy merge (visual): Like Run A3 but feature weights are based on their normalized accuracy in classifying images in the 2012 ImageCLEF medical modality classification test set. A6. nlm-image-based-mixed (mixed): A combination of Queries A1.Q1–Q2 with Run A2. A7. Txt Img Weighted Merge (mixed): A score-based combination of Runs A1 and A4. A8. Merge RankToScore weighted (mixed): A rank-based combination of Runs A1 and A4. A9. Txt Img Weighted Merge A (mixed): A score-based combination of A1 and Run A5. A10. Merge RankToScore weighted A (mixed): A rank-based combination of Runs A1 and A5. 3. Results Tables 2–4 summarize the results of our modality classification, compound figure separation, and image retrieval runs. In Tables 2 and 3, we give the accuracy of our figure classification and separation methods, respectively. In Table 4, we give the mean average precision (MAP), binary preference (bpref) and precisionat-ten (P@10) of our retrieval methods. A “*” in Tables 2–4 indicates the best performing textual, visual, and mixed runs submitted as part of the 2013 ImageCLEF evaluation. The results of other

Table 2 Accuracy results for the modality classification task. ID nlm nlm nlm nlm nlm nlm

mixed using 2013 visual classification 2 mixed using 2013 visual classification 1 mixed hierarchy mixed using 2012 visual classification visual only hierarchy textual only flat

IBM modality run8 IBM modality run4 IBM modality run1

Mode

Accuracy (%)

Mixed Mixed Mixed Mixed Visual Textual

69.28 68.74 67.31 67.07 61.50 51.23

Mixed Visual Textual

81.68* 80.79* 64.17*

Table 3 Accuracy results for the compound figure separation task. ID

Mode

Accuracy (%)

nlm multipanel separation ImageCLEF2013 CompoundFigureSeparation HESSO CFS

Mixed Visual

69.27* 84.64*

Table 4 Retrieval results for the ad-hoc image retrieval task. ID

Mode

MAP

bpref

P@10

nlm-se-image-based-mixed nlm-se-image-based-textual Txt Img Wighted Merge A Merge RankToScore weighted A Txt Img Wighted Merge Merge RankToScore weighted image latefusion merge image latefusion merge filter latefusuon accuracy merge nlm-se-image-based-visual

Mixed Textual Mixed Mixed Mixed Mixed Visual Visual Visual Visual

0.3196* 0.3196* 0.3124 0.3120 0.3086 0.3032 0.0110 0.0101 0.0092 0.0002

0.2983 0.2982 0.3014 0.2950 0.2938 0.2872 0.0207 0.0244 0.0179 0.0021

0.3886 0.3886 0.3886 0.3771 0.3857 0.3943 0.0257 0.0343 0.0314 0.0029

DEMIR4

Visual

0.0185*

0.0361

0.0629

participants are included at the end of each table if our runs did not achieve the best results. Figs. 5–7 summarize the results of our field weight optimization experiments. Fig. 5 presents the results of our content-based retrieval runs, Fig. 6 presents the results of our text-based runs, and Fig. 7 summarizes our one multimodal run. For each retrieval run, we include a graph showing the optimized field weights learned on the 2011 ImageCLEF collection. Also included in each figure is a table of results obtained for the 2011 and 2012 collections comparing retrieval performance using the depicted, optimized weights with retrieval performance using the uniform, unoptimized weights. Essie’s average retrieval performance is reported as binary preference and judged mean average precision (MAP ), a retrieval metrics known to be robust against incomplete judgements [30]. The statistical significance was measured between runs having unoptimized weights and those having optimized weights using Fisher’s two-sided, paired randomization test [31], which is a recommended statistical test for evaluating information retrieval systems. A “*” in the tables indicates a statistically

Fig. 5. Optimized field weights and validation results for our content-based image retrieval runs using Essie.




9

Fig. 6. Optimized field weights and validation results for our text-based image retrieval runs using Essie.

significant (p < 0.05) improvement in retrieval performance using the optimized weights. Note that results for the 2011 collection are expected to demonstrate greater performance improvements because the field weights were trained and evaluated on the same set of topics.

flat classification approach, and it demonstrates the utility of combining the results of independent textual and visual modality classifiers. In our two best-performing methods, the images were classified using textual features, and visual features were only considered for images with no associated text.

4. Discussion

4.2. Compound figure separation

Our results demonstrate that our methods are effective for performing literature-based biomedical image classification and retrieval tasks. Below we discuss the results we obtained for the modality classification, compound figure separate, and image retrieval tasks of the medical track of ImageCLEF 2013. Also discussed is the effectiveness of our field weight optimization strategy in improving retrieval performance.

For the compound figure separate task, nlm multipanel separation, our mixed approach, resulted in an accuracy of 69.27% and was ranked second among four submissions from three groups participating in this task. Although there are few methods with which to compare our approach, these results show that our current mixed strategy for separating compound figures is reasonably effective and competitive with other stage-of-the-art methods. Compound figure separation can be thought of as the first stage of a pipeline of image processing tasks: Images are first separated into individual panels, the panels are then classified by image modality, and finally, the classified panels are prepared for indexing and retrieval by structuring their textual and visual features into enriched citations. Because the effectiveness of our classification and retrieval techniques depends on correctly segmenting multi-panel figures, we hope to improve our compound figure separation accuracy in future work.

4.1. Modality classification For the modality classification task, our best submission was nlm mixed using 2013 visual classification 2, a mixed approach that achieved a classification accuracy of 69.28% and was ranked within the submissions for the top five participating groups in the modality classification task. The effectiveness of this approach reveals that hierarchical classification generally performs better than the

Fig. 7. Optimized field weights and validation results for our mixed image retrieval runs using Essie.


G Model CMIG-1279; No. of Pages 11 10


4.3. Image retrieval For the image retrieval task, our best submission was nlmse-image-based-mixed, a mixed approach that achieved a mean average precision of 0.3196 and was ranked first among all participating groups. Our second-best image retrieval approach, nlm-se-image-based-textual, also achieved a mean average precision of 0.3196, but a slightly lower but statistically insignificant binary preference score. As our past experience would indicate, our worst performing submissions were our visual approaches. This is suspected to be due to the inability of our visual descriptors to adequately represent the semantics of the images. Of particular note is nlm-se-image-basedvisual, which achieved the lowest performance of all our visual methods. The degraded performance is perhaps due to the fact that, being based on a clustering of visual descriptors, this method is only able to approximate the visual similarity between images whereas the other visual methods compute visual similarity directly. Although statistically indistinguishable from our textual approach, the fact that our highest ranked submission was a mixed approach is an encouraging result, and provides evidence that our ongoing efforts at integrating textual and visual information will be successful. In particular, the representation of our visual features as cluster words, which are indexed and retrieved using a traditional text-based information retrieval system, is an effective way, not only of incorporating visual information with text, but of avoiding the computational expense common among contentbased retrieval methods. Furthermore, our other mixed approaches demonstrate the effectiveness of rank and score merging in improving upon our visual retrieval results. While we did not directly evaluate the importance of using the UMLS synonymy for term expansion, we suspect its contribution is significant since it is central to how Essie functions. Some of our other mixed runs, in utilizing the results of our modality classifiers, may have been weakened due to the modest performance of our classification methods. The effectiveness of Essie in performing many of our image retrieval submissions is partially due to the weights it gives to the fields of our enriched citations. The results of our field optimization evaluation demonstrate that a gradient-based optimization method can significantly improve Essie’s performance. Our results also reveal the textual and visual features that are most important for particular retrieval tasks. Several generalizations can be made from these results. First, our content-based retrieval runs seem to benefit the most from field weight optimization, whereas the performance of our text-based and multimodal runs remain largely unaffected. Second, the optimized field weights seem to be topicspecific, and improvements on the 2011 ImageCLEF collection do not necessarily yield similar improvements on the 2012 collection. Although field weight optimization generally results in better performance, we often did not find the differences to be statistically significant. The inability of gradient-based methods to directly optimize our retrieval metrics, instead relying on a surrogate performance measure such as the RankNet cost function, may have contributed to this limitation. The results of our content-based, text-based, and multimodal retrieval runs are discussed in more detail below. For content-based retrieval, our optimization strategy demonstrated statistically significant improvements in both binary preference and mean average precision on the 2011 collection. Optimization of the field weights resulted in an allocation of weight that favored CLD and SCONCEPT, which corresponds well with our expectation that color and texture features would be the most significant. For text-based retrieval, we found a statistically significant improvement in binary preference on the 2012 ImageCLEF

collection. However, Optimization resulted in little variation among the textual field weights. Somewhat surprisingly, we found the MeSH and title fields to be the least effective for text-only retrieval. It is suspected that the increased weight of the MeSH fields might have obviated the need for the title field, as an article’s title and its assigned MeSH terms may be highly correlated. Finally, for multimodal retrieval, our optimization strategy achieved better average retrieval performance than both our text-based and content-based runs. A statistically significant improvement in mean average precision was observed on the 2012 collection, but, similar to our text-based run, we did not observe this difference in performance on the 2011 collection. Weight optimization resulted in an allocation of weight to our textual fields similar to what we found for our text-based run, with the MeSH fields again receiving the least amount of weight. However, unlike our content-based run, the allocation of weight to the visual fields used by our multimodal run was uniform. This result was surprising as we expected the visual features to be much less significant than the textual features. Given the strength of our combined textbased run, it is likely that adjusting the visual field weights makes little difference in the performance of our multimodal run. Such an observation could result from Essie’s textual field scores being much greater than its visual field scores, and this possibility is worth further study. 5. Conclusion This work describes our participation in the compound figure separation, modality classification, and image retrieval tasks of the 2013 medical track of ImageCLEF. In doing so, we first introduced the textual and visual features we use to represent images and how we organize them into structured representations known as enriched citations. Our methods for each of the three literaturebased image informatics tasks were then described. For the image retrieval task, we described the Essie information retrieval system, how it scores and ranks our enriched citations, and a gradientbased strategy for optimizing the feature weights used in its scoring function. Finally, we described the runs we submitted as part of the ImageCLEF evaluation, and presented their results. Our results demonstrated the effectiveness of our methods. For the modality classification task, our best submission achieved a classification accuracy of 69.28%. Our submission for the compound figure separation task achieved an accuracy of 69.27%, and our best submission for the image retrieval task achieved a mean average precision of 0.3196. Our image retrieval approach was ranked first among all the submissions from the groups participating in the ImageCLEF evaluation. In performing the retrieval task, we used our gradientbased optimization strategy to determine ideal feature weights for our text-based, content-based, and mixed retrieval runs. Using the optimized weights, we observed statistically significant improvements in retrieval performance. This result demonstrates that the optimization of textual and visual feature weights is an effective means of improving the retrieval of images from biomedical literature. Finally, in each of the above tasks, we obtained our best results using mixed approaches, which indicates the importance of both textual and visual features for literature-based biomedical image classification and retrieval.

Acknowledgments This work is supported by the intramural research program of the U.S. National Library of Medicine, National Institutes of Health, and by appointments to the U.S. National Library of Medicine Research Participation Program administered by the Oak Ridge Institute for Science and Education.




References [1] Datta R, Joshi D, Li J, Wang JZ. Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 2008;40(2):5:1–60. [2] de Herrera AGS, Kalpathy-Cramer J, Demner-Fushman D, Antani S, Müller H. Overview of the ImageCLEF 2013 medical tasks. In: Working notes of CLEF 2013. 2013. p. 1–15. [3] Müller H, Clough P, Deselaers T, Caputo B, editors. ImageCLEF: experimental evaluation in visual information retrieval; vol. 32 of the information retrieval series. Heidelberg: Springer; 2010. [4] Fujita S. Retrieval parameter optimization using genetic algorithms. Inf Process Manage 2009;45(6):664–82. [5] Taylor M, Zaragoza H, Craswell N, Robertson S, Burges C. Optimization methods for ranking functions with multiple parameters. In: Proceedings of the 15th ACM international conference on information and knowledge management. 2006. p. 585–93. [6] Chapelle O, Wu M. Gradient descent optimization of smoothed information retrieval metrics. Inf Retr 2010;13(3):216–35. [7] Chatzichristofis SA, Boutalis YS. FCTH: Fuzzy color and texture histogram: a low level feature for accurate image retrieval. In: Proceedings of the 9th international workshop on image analysis for multimedia interactive services. 2008. p. 191–6. [8] Srinivasan GN, Shobha G. Statistical texture analysis. In: Proceedings of World Academy of Science, Engineering and Technology; vol. 36. 2008. p. 1264–9. [9] Mäenpää T (Ph.D. thesis) The local binary pattern approach to texture analysis – extensions and applications. University of Oulu; 2003. [10] Lowe D. Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision; vol. 2. 1999. p. 1150–7. [11] Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE Trans Syst Man Cybern 1978;8(6):460–73. [12] Chang SF, Sikora T, Puri A. Overview of the MPEG-7 standard. IEEE Trans Circuits Syst Video Technol 2001;11(6):688–95. [13] Chatzichristofis SA, Boutalis YS. CEDD: Color and edge directivity descriptor: a compact descriptor for image indexing and retrieval. In: Gasteratos A, Vincze M, Tsotsos JK, editors. Proceedings of the 6th international conference on computer vision systems; vol. 5008 of lecture notes in computer science. Berlin: Springer; 2008. p. 312–22. [14] Rahman MM, Antani S, Thoma G. A medical image retrieval framework in correlation enhanced visual concept feature space. In: Proceedings of the 22nd IEEE international symposium on computer-based medical systems. 2009. p. 1–4. [15] Lux M, Chatzichristofis SA. LIRe: Lucene image retrieval – an extensible java CBIR library. In: Proceedings of the 16th ACM international conference on multimedia. 2008. p. 1085–8.

11

[16] Simpson MS, Demner-Fushman D, Antani SK, Thoma GR. Multimodal biomedical image indexing and retrieval using descriptive text and global feature mapping. Inf Retr 2013, http://dx.doi.org/10.1007/s10791-013-9235-2. [17] Arthur D, Vassilvitskii S. k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, SODA’07. 2007. p. 1027–35. [18] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD explorations 2009;11(1). [19] Apostolova E, You D, Xue Z, Antani S, Demner-Fushman D, Thoma GR. Image retrieval from scientific publications: text and image content processing to separate multi-panel figures. J Am Soc Inf Sci Technol 2013;64(5): 893–908. [20] You D, Rahman MM, Antani S, Demner-Fushman D, Thoma GR. Text- and content-based biomedical image modality classification. In: Proceedings of SPIE; vol. 8674 of medical imaging 2013: advanced PACS-based imaging informatics and therapeutic applications. 2013. p. 86740L-1–86740L-8. [21] Jung K, Kim KI, Jain AK. Text information extraction in images and video: a survey. Pattern Recogn 2004;37(5):977–97. [22] Ide NC, Loane RF, Demner-Fushman D. Essie: a concept-based search engine for structured biomedical text. J Am Med Inform Assoc 2007;1(3):253–63. [23] McCray AT, Ide NC. Design and implementation of a national clinical trials registry. J Am Med Inform Assoc 2000;7(3):313–23. [24] Lindberg D, Humphreys B, McCray A. The unified medical language system. Methods Inf Med 1993;32(4):281–91. [25] Demner-Fushman D, Lin J. Answering clinical questions with knowledge-based and statistical techniques. Comput Linguist 2007;33(1):63–103. [26] Manning CD, Raghavan P, Schütze H. An introduction to information retrieval. New York: Cambridge University Press; 2009. [27] Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, et al. Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning. 2005. p. 89–96. [28] Metzler D. Using gradient descent to optimize language modeling smoothing parameters. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. 2007. p. 687–8. [29] Byrd R, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 1995;16(5):1190–208. [30] Sakai T. Alternatives to bpref. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. 2007. p. 71–8. [31] Smucker MD, Allan J, Carterette B. A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on information and knowledge management. 2007. p. 623–32. [32] Huang KY, Wang CR, Yang RS. Rare clinical experiences for surgical treatment of melanoma with osseous metastases in Taiwan. BMC Musculoskelet Disord 2007;8(70).


Evaluating performance of biomedical image retrieval systems--an overview of the medical image retrieval task at ImageCLEF 2004-2013.

Local mesh patterns versus local binary patterns: biomedical image indexing and retrieval.

Optimal query-based relevance feedback in medical image retrieval using score fusion-based classification.

Software suite for image archiving and retrieval.

Vividness of image and retrieval time.

Mining biomedical images towards valuable information retrieval in biomedical and life sciences.

Towards large-scale histopathological image analysis: hashing-based image retrieval.

An image retrieval framework for real-time endoscopic image retargeting.

Mining biomedical images towards valuable information retrieval in biomedical and life sciences.

Pareto-depth for multiple-query image retrieval.

Spectral embedded hashing for scalable image retrieval.

Medical Image Retrieval: A Multimodal Approach.

Probability estimation for biomedical classification problems.

Polar Embedding for Aurora Image Retrieval.

Content-based histopathology image retrieval using CometCloud.

Visual parameter optimisation for biomedical image processing.

A retrieval system for biomedical slides using MeSH.

Texture classification and retrieval using shearlets and linear regression.

Experiments with a novel content-based image retrieval software: can we eliminate classification systems in adolescent idiopathic scoliosis?

Biomedical signal and image processing for clinical decision support systems.

Biomedical image segmentation using geometric deformable models and metaheuristics.

Computer-aided diagnosis of mammographic masses using scalable image retrieval.

Multiscale distance coherence vector algorithm for content-based image retrieval.

Dictionary Pruning with Visual Word Significance for Medical Image Retrieval.