This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

1

Image Search Re-ranking with Query-dependent Click-based Relevance Feedback Yongdong Zhang, Senior Member, IEEE, Xiaopeng Yang, and Tao Mei *, Senior Member, IEEE

Abstract—Our goal is to boost text-based image search results via image re-ranking. There are diverse modalities (features) of images that we can leverage for re-ranking, however, the effects of different modalities are query-dependent. The primary challenge we face is how to fuse multi-modalities adaptively for different queries, which has often been overlooked in previous re-ranking research. Moreover, multi-modality fusion without an understanding of the query is risky, and may lead to incorrect judgment in re-ranking. Therefore, to obtain the best fusion weights for the query, in this paper, we leverage click-through data, which can be viewed as “implicit” user feedback and an effective means of understanding the query. A novel re-ranking algorithm, called click-based relevance feedback, is proposed. This algorithm emphasizes the successful use of click-through data for identifying user search intention, while leveraging multiple kernel learning algorithm to adaptively learn the querydependent fusion weights for multiple modalities. We conduct experiments on a real-world dataset collected from a commercial search engine with click-through data. Encouraging experimental results demonstrate that our proposed re-ranking approach can significantly improve the NDCG@10 of the initial search results by 11.62%, and can outperform several existing approaches for most kinds of queries, such as tail, middle and top queries. Index Terms—Image search, search re-ranking, click-based relevance feedback, multiple kernel learning.

I. I NTRODUCTION As image websites, such as Flickr1 and Pinterest2 , developed rapidly, a satisfying image retrieval system is imperative to such systems’ continuous improvement in the user experience. Nowadays, many techniques have been developed for multimedia search [30][40]. Due to the success of information retrieval, most commercial search engines still employ textbased search techniques for image search by using surrounding textual information. As the text information is sometimes noisy and even unavailable, the drawback of such a retrieval method is that it cannot describe the contents of images precisely thus hampering the performance of image search. In order to boost the performance of web image search and overcome the sematic gap between text information and image content, image search re-ranking, which adjusts the initial ranking orders by mining visual content or leveraging some Copyright (c) 2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. * Correspondence author: T. Mei. Y. Zhang and X. Yang are with the Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, P. R. China (email: [email protected]; [email protected]). T. Mei is with the Microsoft Research, Beijing 100080, P. R. China (email: [email protected]). 1 http://www.flickr.com/ 2 http://pinterest.com/

auxiliary knowledge, has been the focus of attention in recent years [19]. There are two primary approaches in this direction: visual pattern mining [10][34] and multi-modality fusion [28][36][38]. The former approach focuses on mining visual recurrent patterns from the initial ranked list based on the following two assumptions: 1) the re-ranked list should not be changed too much from the initial ranked list, and 2) visually similar images should be ranked in close proximity to each other. This category treats different modalities independently and seldom deals with the problem of how to fuse multiple modalities, e.g., color, texture and shape. In contrast, multimodality fusion aims to learn the modality weights in a linear or non-linear way to combine them to achieve better re-ranking performance. It is known that the effects of modalities are query-dependent. For instance, for some queries like “heart” and “sun,” the color feature may be more useful, while for some queries such as “buildings,” the texture feature will be more effective. Learning appropriate modality weights for each query, nevertheless, is not a trivial problem but often remains overlooked in existing research. Although the above two directions of image search re-ranking have made great progress, challenges remain in determining whether an image is relevant to the search query. To solve this problem, many researchers attempt to use a relevance feedback mechanism [22][41], which requests that users provide relevance scores for images, so that the scores can be treated as an evaluation standard to determine the relevance of the corresponding images. However, it is not easy to obtain sufficient and explicit user feedback as users are often reluctant to provide enough feedback to search engines. It should be noted that search engines can record queries issued by users and the corresponding clicked images. Although the clicked images, along with their corresponding queries, cannot reflect the explicit user preference on relevance of particular query-image pairs, they statistically indicate the implicit relationship between individual images in the ranked list and the given query. Therefore, we can regard click-through data as “implicit” user feedback based on the assumption that most clicked images are relevant to the given query. In general, as the footprints of user search behavior, click-through data is not only useful for providing implicit relevance feedback from users but is also readily available and freely accessible by search engines. As proven in [37], click-through data can be used to improve the performance of image search re-ranking using individual modality. The remaining issue is whether click-through data can be leveraged in multi-modality fusion and enhance the accuracy of modality weights prediction. To address this issue,

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR Initial ranked list

2 Initial ranked list

Reranked list

Pseudo-positive data

...

Pseudo-positive data

Modality1

...

...

Modalityn

clicked data

... A given query

. . .

SVM1 ... SVMn ...

Pseudo-negative data ...

Reranked list

Modality1

...

A given query

. . .

...

Modalityn

kernel1

...

kerneln

SimpleMKL Pseudo-negative data

Modalityn

Modality1

...

kernel1

...

kerneln

Modality1

...

Modalityn

Other queries

(a) Framework Overview of PRF. Fig. 1.

(b) Framework Overview of CBRF.

Framework Comparison between Pseudo-relevance Feedback and Click-based Relevance Feedback (better viewed in color).

we propose a novel re-ranking algorithm, called Click-Based Relevance Feedback (CBRF), which not only leverages clickthrough data to help determine the user intention, but also integrates a multiple kernel learning algorithm to learn the querydependent fusion weights of multiple modalities. Compared with Pseudo-Relevance Feedback (PRF) [34], which feeds back estimated relevance based on the initial ranked list, the proposed CBRF leverages the implicit relevance from clickthrough data and refines the initial retrieval results using the classification output from a multiple kernel learning machine. More specifically, PRF uses all query images as positive data and treats bottom ranked images of the initial ranked list as negative data. In most image searches, however, query images are not usually available, resulting in top ranked images being chosen as positive data in PRF [33]. In comparison with top ranked images, clicked images are more representative and reliable as relevant images, since the decision made by users on whether to click an image or not is mostly based on the fact that they have browsed image thumbnails and is more likely dependent on the relevance of an image. Furthermore, bottom ranked images used in PRF as negative data are not definitely irrelevant to the given query. Thus, CBRF leverages clicked images as positive data and randomly selects images from other queries as negative data. In order to explore how consistent (contradictory) modalities could incorporate (compromise) with each other, CBRF builds a simpleMKL [21] framework. This machine learning framework integrates Multiple Kernel Learning (MKL) into a Supported Vector Machine (SVM) to automatically learn the modality weights. Unlike PRF which simply and linearly combines all posterior probabilities from different modalities at the decision level, CBRF explores how to fuse different modalities at the feature level and uses simpleMKL to determine the weights of different modalities, where each modality corresponds to a specific kernel or multiple kernels. The framework overviews of PRF and CBRF are illustrated on the left and right plot of Figure 1, respectively. As Figure 1(a) shows, PRF treats the top (bottom) ranked images from the initial ranked list as pseudo-positive (pseudo-negative) data. With these training data, generally, PRF leverages an individual modality to perform re-ranking, which means the number of SVMs is equal to 1, i.e., n = 1. When dealing with multiple

modalities, PRF uses multiple SVMs correspondingly, and then simply fuses the outputs from different SVMs in a linear way as the final re-ranking scores. In contrast, as Figure 1(b) shows, CBRF treats clicked images as pseudo-positive data and randomly selects images from other queries as pseudonegative data. For multiple modalities of images, each modality corresponds to a specific kernel according to its data distribution. The re-ranking scores can be obtained based on querydependent multi-modality fusion weights learnt adaptively by simpleMKL. We apply the proposed CBRF to perform image search re-ranking and conduct experiments over 60 image queries collected from a commercial image search engine. Experimental results show that the proposed CBRF increases the NDCG@10 of the initial search results by 11.62% and outperforms several existing methods. The contributions of this paper can be summarized as follows: • We use click-through data and leverage multiple kernel learning simultaneously to boost image search performance. To the best of our knowledge, this is the first attempt at multi-modality fusion in image search reranking. • We propose an effective novel image search re-ranking method, called click-based relevance feedback, which transforms image re-ranking into a classification problem. It leverages the clicked images as positive data and images from other queries as negative data to improve classification accuracy and can automatically learn the fusion weight of each modality for different queries at the feature level. • The algorithm is demonstrated on a real-world large-scale dataset consisting of 115,792,480 image URLs collected from a commercial search engine. The remaining sections are organized as follows. Section II introduces related work. Section III details the proposed algorithm. Section IV analyzes click-through data collected from a large-scale query log. Section V presents the experimental results of the image search re-ranking. Section VI draws conclusions on our proposed method. II. RELATED WORK Our work is related to several research topics, and we group related work into three categories: visual search re-

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

3

ranking with pattern mining, visual search re-ranking with multi-modality fusion, and click-through data mining.

B. Visual Search Re-ranking With Multi-modality Fusion

A. Visual Search Re-ranking With Pattern Mining

When dealing with multiple modalities, one of the earliest considerations is what strategy we can use to fuse different modalities. The first thought is usually to concatenate various features into a long feature vector and use this joint modality to perform a specific task [25]. This approach is usually called “early fusion.” Alternatively, we can apply multiple low-dimension features into the learning algorithm separately and fuse the results [29][34], which is called “late fusion.” Early fusion and late fusion are advantageous compared with re-ranking approaches using individual modality. However, early fusion suffers from “curse of dimensionality” and late fusion cannot determine the proper fusion weights for different modalities.

According to how the external knowledge is exploited, the research of visual search re-ranking with pattern mining has four paradigms: self-re-ranking [2][9][10][15][33], example-based-re-ranking [17][34], crowd-re-ranking [18][39] and interactive-re-ranking [8][27][32]. Self-re-ranking focuses on detecting recurrent patterns in the initial search results without any external knowledge, and then uses the recurrent patterns to perform re-ranking. Hsu et al. formulate re-ranking as a random walk problem along the context graph, where video stories are represented as nodes and the edges between them are weighted by multimodal contextual similarities [10]. To train a re-ranking classifier, Yan et al. bring forward the conventional idea of pseudorelevance feedback (PRF) that the top-ranked (bottom-ranked) documents should be chosen as pseudo-positives (pseudonegatives) [33]. Compared with self-re-ranking, examplebased-re-ranking mainly relies on query examples provided by users. For example, Liu et al. [17] leverage query examples to identify an optimal set of document pairs via pair-wise mutual information which are directly used to recover the final reranked list. In contrast to self-re-ranking and example-basedre-ranking, crowd-re-ranking is characterized by mining relevant patterns from crowd-sourced knowledge available on the Web, e.g., multiple initial ranked results from various search engines [18] and suggested queries augmented from image collection on the Web [39]. Unlike the above three paradigms, interactive-re-ranking involves user interaction, such as human labeling and feedback, to guide the re-ranking process. For instance, Yamamoto et al. require users to edit a part of the search results based on whether they are relevant or not, and propagate these operations to all the results to rerank them [32]. Hauptmann et al. propose employing an active learning technique to exploit both human bandwidth and machine capabilities for video search re-ranking [8]. In order to reduce the efforts of users’ labeling, Tian et al. propose a sample selection strategy based on images’ structural information, and then use a discriminative dimension reduction algorithm to capture user intent in the visual feature space [27]. Based on rounds of human evaluation, interactive-re-ranking can meet user search intent further than other methods. However, it should be noted that users are often reluctant to provide enough feedback. Through mining query logs, we leverage click-through data (the number of clicks aggregated over users and sessions in the query log), instead of human labeling or feedback, to boost image search performance. In addition, all the mentioned re-ranking methods have not explored the interactions between different modalities (features) for re-ranking. Linearly combining re-ranking results of textual features and visual features [34] cannot reflect the influence of each feature on the related query. To address this issue, we investigate image search re-ranking with multiple modalities using simpleMKL [21], which can adaptively learn and predict query-dependent fusion weights for multiple modalities.

In order to learn modality weights, different kinds of fusion methods are leveraged by multimedia researchers, which can be categorized into rule-based fusion [24], query-classdependent fusion [16][35] and adaptive fusion [13][26][28]. Rule-based fusion heuristically assigns the modality weights based on the type of query. For instance, based on user study, Snoek et al. let users designate the weights of different modalities manually for text-, concept- and visual-oriented queries [24]. This method, though easy to implement, may degrade the retrieval performance due to the wrong weight being assigned by users. With more specific and pre-defined query classes, query-class-dependent fusion learns the weights of each class. In [35], Yan et al. classify each user query into one of four predefined categories and the retrieval results are aggregated with the help of query-class-associated weights. Using a clustering algorithm, such as K-means and hierarchical clustering, Kennedy et al. [16] propose a framework for the automatic discovery of query classes and the applicable designation of multimodal weights. Generally, this scheme is effective when the query classes can be clearly defined and each class has enough training examples to learn the modality weights at a query-class level. In most cases, nevertheless, a user query cannot be classified into a specific class accurately since it may belong to several different classes. Given this, adaptive fusion aims to automatically learn modality weights on a user query level. Jhuo and Lee leverage a multiple kernel learning algorithm to learn feature fusion weights, and then use them to obtain an effective visual similarity [13]. In [28], Wang et al. integrate multiple graphs into a regularization and optimization framework to explore the complementary nature of multiple modalities. In contrast to visual search re-ranking with pattern mining, existing literature on multi-modality fusion focuses on how to leverage various modalities adequately. Nevertheless, user feedback, which is an explicit indication of relevance, is ignored. Our work integrates click-through data, which can be viewed as implicit user feedback, into a supervised learning framework for multi-modality fusion. We will show that with the aid of click-through data our proposed approach can effectively boost the performance of image search re-ranking.

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

4

C. Click-through Data Mining

A. Overview

Click-through data have been widely used in information retrieval [3][5][6][7][14][31]. To improve web search performance, Xue et al. propose an iterative algorithm leveraging user click-through data, which fully explores the interrelations between queries and web pages and effectively finds the virtual associated queries for web pages [31]. In [14], Joachims et al. use eye tracking to analyze the relationship between the click-through data and the relevance of query web pages in web search, and prove that click-through data can be used to obtain relative relevance judgments. Ben et al. mine useful information from the millions of clicks received by web search engines to predict document relevance [3]. Since clickthrough data are informative but biased, in order to predict unbiased click data for web search ranking, Dupret et al. propose a model based on users’ browsing behavior, which can estimate the probability that a document is seen, and thereby provides an unbiased estimate of document relevance [7]. They further use document relevance as a feature for a “Learning to Rank” machine learning algorithm [6]. With a similar purpose, Chapelle et al. build a Dynamic Bayesian Network (DBN) to provide an unbiased estimation of relevance from click logs [5]. In image search, users browse image thumbnails before selecting the images to click. The decision to click is likely dependent on the relevance of an image. Thus, intuitively, click-through data can serve as reliable feedback, potentially useful for search re-ranking. In spite of the successful use of click-through data in information retrieval, there are few works that integrate them into multimedia retrieval. Dealing with long-tail queries, Jain et al. employ a Gaussian Process regression to predict the normalized click count for each image, and combine it with the original ranking score for reranking [11]. In [37], Yang et al. leverage click-through data and detect recurrent visual patterns simultaneously to boost the performance of image re-ranking. The proposed clickboosting random walk (CBRW) re-ranking is first formulated as a ranking problem according to images’ click number, and then formulated as a random walk problem on an image graph. The effectiveness of CBRW has been proven, but it cannot handle predicting modality weights. In this paper, we focus on using click-through data to perform multi-modality fusion for image search re-ranking. We will show that our proposed method can improve the performance of image search reranking significantly by exploiting the influence of clickthrough data and a proper fusion of multiple modalities.

Similar to pseudo-relevance feedback, the proposed clickbased relevance feedback treats image search re-ranking as a binary classification problem, where positive data consist of relevant images in the collection and negative data the irrelevant ones. The main idea of CBRF is to leverage clickthrough data of the related query as pseudo-positive data and apply a multiple kernel learning (MKL) algorithm to learn suitable query-dependent fusion weights for multiple modalities. For a given query, as Figure 1(b) demonstrates, when collecting training data, CBRF first chooses the clicked images (denoted by red circles in Figure 1(b)) as pseudo-positive data, and then randomly selects images from other queries (denoted by blue circles in Figure 1(b)) in the dataset as pseudo-negative data. With these training data, multiple visual features extracted offline are added into a simpleMKL classifier, which aims to sufficiently explore the influence of different modalities for the given query. As we can see from Figure 1(b), before multi-modality fusion, each modality is assigned a specific kernel according to its data distribution. Then, through the multiple kernel learning process, the simpleMKL classifier predicts the fusion weights of multiple modalities adaptively and query-dependently. With an appropriate combination of multiple modalities, the classifier outputs the re-ranking scores of images from the initial ranked list, which are equal to the posterior probabilities of images classified as positive data. This whole re-ranking process can be considered a partially supervised learning from the viewpoint of machine learning.

III. C LICK - BASED R ELEVANCE F EEDBACK Inspired by the retrieval approach Pseudo-Relevance Feedback (PRF) [33][34], we propose a novel re-ranking algorithm, called Click-Based Relevance Feedback (CBRF), which not only sufficiently leverages click-through data but also adequately exploits the interactions between multiple modalities. We begin this section by an overview of our proposed CBRF, elaborate multi-modality fusion with simpleMKL [21] embedded in CBRF, and give the algorithm details of CBRF.

B. SimpleMKL Ensembles SimpleMKL [21] is the MKL algorithm used in click-based relevance feedback re-ranking, since it has been proven to have rapid convergence and high efficiency compared with other MKL algorithms. It iteratively determines the combination of modality weights by solving a standard SVM optimization problem based on a gradient descent method. In particular, simpleMKL formulates an MKL problem with a weighted l2 norm regularization, while controls the linear combination of kernels by a l1 -norm constraint on the kernel weights. 1) Formulation: In simpleMKL, we use the so-called kernel K(x, x′ ) to represent the data of each modality which defines the similarity between two examples x and x′ . Due to the fact that the data distribution of each modality may be manifold, we should assign kernel K(x, x′ ) to different modalities n accordingly. Suppose we have n training data D= {xi , yi }i=1 , where xi represents each image, yi is the training label of the corresponding image and yi ∈ {+1, −1}. There are M kinds of modalities for each image xi . For the mth modality, in this paper, we formulate the corresponding kernel matrix based on the radial basis function (RBF): m 2 Km (xi , xj ) = exp(−γm ∥ xm i − xj ∥ ),

(1)

th where m = 1, 2, ..., M , xm feature i corresponds to the m of image xi and γm is a positive parameter which can be computed according to the data distribution. Consequently,

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

with M various parameters, M different kernel matrices can be generated. To combine multiple modalities, naturally, a conventional approach is to combine the basic kernels in a convex way: K(xi , xj ) =

M ∑

m wm Km (xm i , xj ),

(2)

m=1 M ∑

where wm ≥ 0 and

wm = 1. Based on the methodology

m=1

of SVM, simpleMKL can be addressed by solving the following convex problem: M ∑ 1 1 {fm },b,ξ,w 2 m=1 wm M ∑

min

s.t.

2

∥ fm ∥ H m + C

n ∑

ξi

i=1

fm (xi ) + yi b ≥ 1 − ξi

yi

∀i

(3)

m=1

ξi ≥ 0 ∀i M ∑ wm = 1,

wm ≥ 0

∀m.

m=1

Specifically, in equation (3), suppose we look for a decision M ∑ function f (x) + b = fm (x) + b, where each function fm m=1

corresponds to a reproducing kernel Hilbert space (RKHS) m Hm of kernel Km (xm i , xj ). ξi is the slack variable, C is the regularization parameter that trades off between the decision function and the slack variable, and w is the weight vector M consisting of {wm }m=1 . It is worth noting that we use the form of 1/wm where wm can be zero. In the formulation of simpleMKL, we specify that ∥ fm ∥ Hm has to be equal to zero to yield a finite objective value when wm = 0. 2) Solutions: To find the decision function f (x) + b and learn the modality weights w, we first deform equation (3) as follows: min J(w) such that w

where

J(w) =

M ∑

wm = 1, wm ≥ 0,

(4)

m=1

     

M ∑ 1 1 2 {fm },b,ξ m=1 wm M ∑

min

 s.t.    

2

∥ fm ∥ H m + C

n ∑

ξi

∀i

i=1

fm (xi ) + yi b ≥ 1 − ξi

yi

m=1

ξi ≥ 0

∀i,

(5) and then use a simple gradient method to solve the above deformation. By calculating the derivatives of the Lagrangian of equation (5) and setting all derivatives with respect to the primal variables to zero, we get the dual problem of equation (5): max − α

with

1 2

n ∑ i=1

n ∑

αi αj yi yj

M ∑ m=1

i,j=1

wm Km (xi , xj ) +

n ∑

αi

i=1

αi yi = 0

0 ≤ αi ≤ C

∀i,

(6) where α is the Lagrange multiplier. It is clear that the dual problem of equation (5) becomes a standard

5

SVM dual problem by replacing the combined kernel M ∑ m K(xi , xj ) = wm Km (xm i , xj ). Thus, we can use any m=1

SVM algorithm to obtain the value of J(w). Assuming that each kernel matrix Km (xi , xj ) is positive definite and all eigenvalues are greater than a positive value η. This property suggests that the dual problem (6) is strictly concave with convexity parameter η for any eligible value of w. That is to say that J(w) has a unique qualified value α∗ that makes J(w) minimal in equation (4). Furthermore, the uniqueness of α∗ ensures the differentiability of J(w) [4]. Thus, based on (6), we can compute the derivatives of J(w) with respect to wm : n 1 ∑ ∗ ∗ ∂J =− α α yi yj Km (xi , xj ) ∂wm 2 i,j=1 i j

∀m.

(7)

To ensure that the equality constraint and the non-negativity constraint on w are satisfied, w is updated by using a descent direction once the gradient of J(w) is computed and a reduced gradient method is used to handle the equality constraint. The scheme for updating w is w ← w + γW , where γ is the step size which can be determined by line search and W is the descent direction calculated as:  ∂J ∂J 0 if wm = 0 and ∂w − ∂w >0   m µ  ∂J ∂J − ∂wm if wm > 0 and m ̸= µ ∂wµ∑ Wm = ∂J ∂J   ( − ) for m = µ,  ∂wv ∂wµ v̸=µ,wv >0

(8) where µ is the index of the largest component of vector w. To sum up, when a stopping criterion, such as the duality gap, the KKT conditions, the variation of w between two consecutive steps, or simply a maximal number of iterations, is met, the simpleMKL algorithm is terminated. The algorithm details of simpleMKL are given in [21]. C. Algorithm Details The overall procedure of our proposed re-ranking algorithm, called click-based relevance feedback (CBRF), can be summarized in Algorithm 1. We use the lowercase q to represent the textual query issued by users, uppercase Q to represent the image set returned from search engines corresponding to query q. For a given query q, suppose we have an image set X with N images to be re-ranked from the search engine where X = {x1 , x2 , . . . , xN } and xi (i ∈ {1, 2, ..., N }) denotes the ith image of the initial ranked list. Moreover, we use Xpos and Xneg to represent the set of positive samples and the set of negative samples separately, and Npos and Nneg to denote the number of images in Xpos and Xneg correspondingly. Because search engines can record the click number of each image, we have a click-through data set C of the image set X where C = {c1 , c2 , ..., cN } and ci (i ∈ {1, 2, ..., N }) denotes the click number of image xi . To be more specific, we give the following explanations about our proposed re-ranking algorithm in Algorithm 1. • The basic idea of our proposed click-based relevance feedback is to boost image retrieval performance by leveraging click-through data, with the choice of using

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

Algorithm 1 Click-based Relevance Feedback Algorithm. Input: For a given query q, and Xpos contains clicked images, i.e., Xpos = {xi |xi ∈ Q ∧ ci > 0, i = 1, 2, ..., N } if Npos < 10 then add some top ranked images from the initial ranked list to Xpos endif Xneg contains images from other queries, i.e., Xneg = {xi |xi ∈ Q′ ∧ Q′ ̸= Q, i = 1, 2, ..., N } 1: Assign basic kernel Km to each modality m 1 2: Set kernel weight wm = M for m = 1, 2, .., M 3: while stopping criterion not met do 4: Minimize J(w) using an SVM solver with ∑ K = m wm Km 5: Update w by a gradient descent method 6: end while 7: Compute posterior probability pi of image xi for i = 1, 2, ..., N so that it is classified as positive by using an SVM solver 8: Reorder images X = {xi |xi ∈ Q, i = 1, 2, ..., N } according to pi in descending order Output: Re-ranked list for query q



all the clicked images as positive data and randomly selecting images from other queries as negative data. This is based on the assumption that clicked images are mostly relevant to the given query. As image retrieval can be viewed as a classification problem, with sufficient training data the classifier will learn an appropriate decision function to discriminate between relevant images and irrelevant images. However, insufficient positive data, which equates to a small number of positive data, will result in an inaccurate classifier. Thus, to overcome this situation, if the number of positive data Npos is less than 10, or if the given query belongs to a tail query whose definition will be given in Section V, we select some top ranked images from the initial ranked list as supplemental positive data. In our experiments, we choose the top (20 − Npos ) images as the auxiliary positive data for tail queries. Conversely, we randomly select images from other queries except for the given one as negative data to construct the set of negative samples Xneg in which there are Nneg images. In step 1 of Algorithm 1, basic kernel of modalities should be chosen, such as RBF kernel, Laplacian kernel or polynomial kernel. In our implementation, we choose the basic RBF (Gaussian) kernel shown in equation (1) as the basic kernel for each modality. Based on different distributions of training data for each modality, each RBF parameter γm will be calculated according to equation (9): Npos ∗ Nneg , xi ∈Xpos ,xj ∈Xneg D(xi , xj )

γm = ∑

(9)

6





where i = 1, 2, ..., Npos , j = 1, 2, ..., Nneg and D(xi , xj ) denotes the cosine distance between image xi from positive data set Xpos and image xj from negative data set Xneg . Then each modality will have a particular RBF kernel. Steps 3 to 6 in Algorithm 1 contain the key process of simpleMKL introduced in Section III-B. Fixing kernel weight vector w, the basic classifier coefficients can be estimated by minimizing J(w). This minimizing process can be easily implemented on any efficient SVM solver. Then, the optimal kernel weight w can be obtained by a reduced gradient method with a fixed J(w). Consequently, the iteration terminates until meets the stopping criterion. Since image re-ranking is not exactly the same as binary classification, which is able to leverage sign functions to determine the object’s category, the re-ranking criterion of images is based on the posterior probability pi of images from the multiple kernel learning classifier [20]. Note that pi is the posterior probability of image xi classified as positive data, i.e., pi = p(Xpos |xi ). IV. CLICK-THROUGH DATA ANALYSIS

We have collected query logs from a commercial image search engine dated Nov. 2012. The query logs are represented as plain text files that contain a line for each HTTP request satisfied by the Web server. For each record, the following fields are used in our data collection: where the ClickedURL and ClickCount represent the URL and the number of clicks on this URL when users submit the Query, respectively. Thumbnail denotes the corresponding image information on the ClickedURL. For analyzing the click-through bipartite graph, we used all the queries in the log with at least one click. There are 34,783,188 queries and 115,792,480 image URLs on the bipartite graph. Figure 2 shows the main characteristics of the query and URL distribution. The left plot of Figure 2 shows the query click distribution. Each point represents the number of queries (y axis) with a given number of clicks (x axis). The plot on the right hand shows the clicked image URL distribution. Each point denotes the number of URLs with a given number of clicks. We can see that these two distributions clearly follow power laws. The observation is similar to [1], which also states that user search behavior follows a power law. Figure 2 shows the associated law. According to the statistics, each query has on average 11.59 clicked URLs and each URL was clicked on average of 3.48 times. V. EXPERIMENTS In this section, we introduce our experimental settings and present the results. We not only give the overall performance that validates the effectiveness of our proposed re-ranking approach, click-based relevance feedback, through comparisons with several re-ranking approaches, but also verify the flexibility of our approach on various kinds of queries.

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR 100000000

100000000

Click Distribution

1000000

1000000

# of URLs

10000000

# of queries

10000000

100000 10000 1000

Clicked URL distribution

100000 10000 1000

100

100

10

10

1

1

1

10

100

1000

10000

100000

1000000

# of clicks (a) Query distribution. Fig. 2.

7

1

10

100

1000

10000

100000

# of clicks (b) Clicked image URL distribution.

Query and clicked image URL distribution for the click-through data. Red lines denote the fitting power law curves.

A. Experimental Settings We begin by grouping image queries into three categories according to the number of clicked images for the related query. We use #click to denote the number of clicked images. For a given query, if #click ≤ 10, then this query belongs to the tail query category, which has a relatively small amount of clicked data; if 10 < #click < 60, we classify it as a middle query; and if #click ≥ 60, this query is considered a top query, most of which are often issued by users to search engines. To facilitate evaluation and compare our proposed method with other methods on the three types of queries, we randomly select 60 queries shown in Figure 3, consisting of 20 top queries 3 , 20 middle queries 4 and 20 tail queries 5 , from the queries mentioned in Section IV as test samples. The 2rd to 4th columns of Table I illustrate the maximum, minimum and average number of clicked images for tail queries, middle queries and top queries respectively. We manually classify these 60 queries into seven semantic categories, which are “animal,” “concept,” “event,” “object,” “scenery,” “people,” and “cartoon.” The number of queries for each semantic category can be seen from the 5th to 11th columns of Table I. 3 The top queries include: (1) baby shower, (2) backsplash ideas, (3) christmas stockings, (4) deadmau5, (5) free thanksgiving pictures, (6) funny dogs, (7) gray wolf, (8) guitar factory, (9) hurricane sandy aftermath pictures, (10) hurricane sandy jersey shore, (11) ledge stone hearth, (12) love tumblr quotes, (13) mermaids, (14) michael jackson house, (15) monster high pictures, (16) northumberlandia, (17) skull, (18) vintage christmas prints, (19) water fountains, (20) wedding dresses. 4 The middle queries include: (1) 15 party dresses, (2) 3d wallpaper, (3) back tattoos, (4) before and after lsd, (5) bouncy castles, (6) cute baby, (7) fall decorating ideas, (8) funny spongebob pictures, (9) gingerbread man, (10) girl teen bedrooms, (11) graffiti drawings, (12) greta garbo, (13) gymnastics pictures, (14) hawaii, (15) kristen stewart, (16) lady gaga, (17) madonna gangnam style, (18) murano glass, (19) ocean animals, (20) tattoo drawing phoenix birds. 5 The tail queries include: (1) 1 direction names, (2) alice in wonderland disneyland, (3) antonio stradivari, (4) beautiful succulent garden, (5) billy beer, (6) catching fire movie pics, (7) cozumel scuba diving, (8) cute offices, (9) farberware coffee robot, (10) laos history, (11) man with the golden gun, (12) model train snow, (13) ohio state backgrounds, (14) pittsburgh at christmas time, (15) proletariat, (16) she is still in diapers, (17) vee jay cement, (18) vote for me posters, (19) white light, (20) winchester home.

Since the images after the top 500 results are typically irrelevant, we use the top 500 images from the initial search results to perform re-ranking, which means that the number of query images N = 500. For query images, we extract six global features based on the common choice in content-based image retrieval [23][28], i.e., M = 6, including: 1) 225-dimensional block-wise color moments. Each image is divided into 5-by-5 blocks, and 9-dimensional color moment features are extracted from each block. 2) 64-dimensional HSV color histogram. The 64-dimensional feature vector is extracted from 64-bin HSV color histograms. 3) 144-dimensional color autocorrelogram. HSV color moments are quantized into 36 bins with four different pixels pair distances. 4) 128-dimensional wavelet texture. 128-dimensional features are extracted by computing the mean and standard deviation of the energy distribution of each image sub-band at different levels. 5) 75-dimensional edge distribution histogram. Each image is split into 5 blocks and 15-dimensional EDH features are extracted from each block. 6) 7-dimensional face features. The number of faces, the ratio of face areas and the position of the largest face region are all contained in the 7-dimensional face features. Readers can refer to the survey [23] for more details about visual features. In our experiments, each query-image pair is labeled carefully by annotators on a scale of 0 to 2: 0–“irrelevant,” 1–“fair” and 2–“relevant.” Before labeling, we first let annotators figure out the meaning of the issued query and check some related web documents to determine user intention. By understanding user intention as much as possible, the relevance scores of images are more convincing relatively. We adopt Normalized Discounted Cumulative Gain (NDCG) [12] to measure performance, which is widely used in information retrieval when there are more than two relevance levels. Given a ranked list, the NDCG score at the depth d is defined as NDCG@d = Zd

∑d j=1

j

2r − 1 , log(1 + j)

(10)

where rj is relevance score of the j th image, and Zd is a

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

8

TABLE I T HE NUMBER DISTRIBUTION

min 1 18 62

tail query middle query top query (1)

(2)

(1)

(2)

(3)

(4)

OF CLICKED IMAGES AND THE NUMBER OF QUERIES FOR EACH SEMANTIC CATEGORY IN OUR DATASET.

(5)

max 10 57 133

(6)

avg 6.15 38.65 82.9

(7)

animal 1 1 0

(8)

(9)

concept 1 1 1 (10)

event 2 1 3

(11)

(12)

object 12 8 8 (13)

scenery 3 2 4 (14)

people 1 5 3 (15)

(16)

cartoon 0 2 1 (17)

(18)

(19)

(20)

(19)

(20)

(a) Top queries. (3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(b) Middle queries. (1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(c) Tail queries. Fig. 3.

The exemplary relevant image thumbnails for the 60 queries in our dataset.

normalization constant to guarantee that a perfect ranking’s NDCG@d is equal to 1. B. Comparison of Different Re-ranking Approaches We compare our proposed re-ranking algorithm with several re-ranking approaches, where the parameters are optimized to achieve the best possible performance. Note that in Section V-B and Section V-C in order to differentiate our proposed re-ranking algorithm click-based relevance feedback, which leverages click-through data and multiple kernel learning simultaneously, from other methods, we abbreviate our reranking approach as CBMKL. For CBMKL, we set the regularization parameter C in equation (3) to 1, and use the duality gap, which is equal to 0.01, as the stopping criterion. Other comparative re-ranking approaches with CBMKL are introduced as follows: • Baseline, i.e., the original search results without reranking. • Click-boosting (CB). CB performs re-ranking by only leveraging click-through data. In other words, CB reranks images according to their click number in descending order. • Click-boosting random walk (CBRW) [37]. A two-step re-ranking method which is a combination of using clickthrough data and detecting visual recurrent patterns for image search re-ranking. • Pseudo-relevance feedback with simpleMKL (PRFMKL). Compared with CBMKL which leverages clicked images as positive data, PRFMKL treats top ranked images from the initial ranked list as positive data. In our implementation, top 20 ranked images from the initial search results are used as positive samples in PRFMKL. • Early fusion based on click-based relevance feedback (EarlyCBRF). Given a query, we simply concatenate six global features into a long vector. Based on the basic idea of click-based relevance feedback introduced in Section





III-C, we then learn a support vector machine (SVM) classifier, and use the classification results to rerank images. Late fusion based on click-based relevance feedback (LateCBRF). Based on the basic idea of click-based relevance feedback, we leverage six SVM classifiers, each of which uses the data of one of the six global features respectively. We linearly fuse the re-ranking results of the six classifier based on six global features, in which the fusion weights are tuned for maximum performance. Average weighted click-based relevance feedback (ACBMKL). In this approach, without multiple kernel learning, we fix the modality weight wm = 1/M for each given query, where M = 6 in our case.

Figure 4 shows the overall performance of different reranking approaches in our dataset. On the whole, our proposed click-based relevance feedback (CBMKL) outperforms other methods, and the improvements are consistent and stable at different depths of NDCG. Using our re-ranking approach the NDCG@10 is improved dramatically by 11.62%, i.e., from the baseline of 0.8114 to 0.9057 on all 60 queries. We can see that click-boosting re-ranking (CB) performs better than the baseline at all NDCG levels, which indicates that click-through data can provide helpful information on user feedback for image re-ranking. Click-boosting random walk (CBRW) improves the re-ranking performance further due to the simultaneous influences of click-through data and visual recurrent patterns of images. Compared with CBRW, there are obvious improvements using PRFMKL, EarlyCBRF, LateCBRF, ACBMKL and CBMKL from NDCG@30 to NDCG@100. This demonstrates that using multiple modalities of images can be more useful than using only one. Although PRFMKL achieves solid performance, it is evident that, even leveraging multiple modalities, the performance of PRFMKL is worse than CBRW’s in the value of NDCG@10. The main reason is that top ranked images from the initial ranked list of

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

Baseline

CBs

CBRW

9

PRFMKL

EarlyCBRF

LateCBRF

ACBMKL

CBMKL

0.91

Average NDCG

0.88 0.85 0.82 0.79 0.76 0.73 0.7 NDCG@10

NDCG@30

NDCG@40

NDCG@50

NDCG@60

NDCG@70

NDCG@80

NDCG@90

NDCG@100

Comparison of re-ranking approaches in terms of NDCG.

images may not be absolutely relevant to the given query. The positive data used in PRFMKL are images with high initial rankings, while these images are all retrieved based on textbased search technology and unable to reflect user intention explicitly or implicitly. Compared with PRFMKL, other methods based on the idea of click-based relevance feedback (EarlyCBRF, LateCBRF, ACBMKL and CBMKL), i.e., leveraging clicked images as positive data, all obtain better re-ranking performance, which validates the fact that most clicked images can be viewed as relevant images to the given query and can reflect users intention implicitly. Moreover, although EarlyCBRF, LateCBRF and ACBMKL are all based on treating clicked images as positive data in the training phrase, they cannot achieve sufficient performance since they cannot adaptively modulate the weights of multiple modalities for different queries. Fortunately, our proposed re-ranking algorithm, clickbased relevance feedback with simpleMKL (CBMKL) is able to learn proper fusion weights of different modalities. Thus, we find that the re-ranking performance of CBMKL can benefit from exploring how consistent (contradictory) modalities are incorporated (compromised) query-dependently. In order to further verify the effectiveness of our proposed re-ranking approach, we normalize the NDCG values with respect to the maximum value and the minimum value. Figure 5 displays the NDCG performance at depth 10 across all 60 queries. As we can see from this figure, 37 out of 60 queries obtain the best performance using our proposed clickbased relevance feedback (CBMKL). Furthermore, 16 queries achieve perfect NDCG value at depth 10, such as query “antonio stradivari,” “hawaii,” and “hurricane sandy jersey shore.” Figure 8 shows the top 15 images of different re-ranking approaches for two queries “hurricane sandy jersey shore” and “laos history.” The most satisfactory results can be obtained using our proposed method. C. Evaluation of Different Queries After the overall evaluation on all the 60 queries in our dataset, we evaluate the proposed re-ranking approaches at the query category level. We group 60 queries in our dataset in two ways. First, according to the number of clicked images for a given query, we categorize queries into top queries, middle queries and tail queries (more details can be found in Section V-A). Then, based on semantic meanings, we group queries into 7 semantic categories: “animal,” “concept,” “event,” “object,” “scenery,” “people,” and “cartoon.” Next, we

CBs

Average improvement in NDCG@20

Fig. 4.

NDCG@20

CBRW

PRFMKL

EarlyCBRF

LateCBRF

ACBMKL

CBMKL

14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% -2.00% -4.00% ≤10(20)

10-60(20)

≥60(20)

Fig. 6. Improvement in NDCG@20 for tail queries, middle queries and top queries. The number of queries in each category is shown in the bracket next to the bin label.

will elaborate the performance of different categories in two aspects accordingly. Figure 6 illustrates the improvements compared with the baseline in NDCG@20 of tail queries, middle queries and top queries using different re-ranking approaches. Our proposed approach achieves the highest improvement in all three categories, with improvements of 12.01%, 11.70% and 11.82% for tail queries, middle queries and top queries respectively. For tail queries, the performance of click-boosting (CB) is worse than the baseline. As CB is particularly useful for queries where there is a high correlation between clicks and relevance, this situation is understandable as the number of clicked images for tail queries is extremely small, and some clicked images may just be clicked in association with user preferences or interests not image relevance. To overcome this problem, we specialize an opinion for tail queries in the first step of our proposed re-ranking approach. In the selection of positive data for the simpleMKL classifier, if the query belongs to the category of tail queries, besides the clicked images, the top ranked images (top (20 − #click) images) are sampled as positive data as well. The same thing happens with top queries using pseudo-relevance feedback with simpleMKL (PRFMKL), making the performance of using PRFMKL worse than the baseline by 3%. This is primarily because the top ranked images from the initial search results are mostly unable to reflect the user intention precisely, making it is risky to choose top ranked images as positive data without understanding user intention. Figure 7 shows the improvement in NDCG@20 with queries grouped based on semantics. We use seven categories: animal (gray wolf, ocean animals), concept(1 direction names, backsplash ideas), event (before and after lsd, baby shower), object(white light, farberware coffee robot), scenery(beautiful succulent garden, hawaii), people (kristen stewart, lady gaga),

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

10

Baseline

Normalized NDCG@10

CBs

CBRW

PRFMKL

EarlyCBRF

LateCBRF

ACBMKL

CBMKL

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Fig. 5.

Normalized NDCG@10 of different methods across 60 queries. Note that NDCG is scaled with max-min normalization.

Average improvement in NDCG@20

CBs

CBRW

PRFMKL

EarlyCBRF

LateCBRF

ACBMKL

CBMKL

20.00% 15.00% 10.00% 5.00% 0.00% -5.00% -10.00% animal(2)

concept(3)

event(6)

object(28)

scenery(9)

people(9)

cartoon(3)

Fig. 7. Improvement in NDCG@20 for different semantic query categories. The number of queries in each semantic category is shown in the bracket next to its name.

and cartoon (funny spongebob pictures, gingerbread man). Note that some typical queries belonging to these semantic categories are shown in the bracket next to their name severally. As shown in Figure 7, our proposed re-ranking approach achieves improvements in NDCG@20 of all these semantic categories. Take semantic category “object” for example, which has the largest number of queries in our dataset. The improvement of our proposed click-based relevance feedback (CBMKL) is very obvious compared with other re-ranking approaches. There are two reasons behind this. On the one hand, high correlation between clicks and relevance is easily discernible in this kind of retrieved image, thus the clicked images are more likely to be relevant to the given query which refers to visually coherent objects. On the other hand, appropriate fusion weights of different modalities are learnt via simpleMKL. Without proper modality fusion weights, each modality cannot play a useful role for different queries. For instance, assigning average weight to different modalities for different queries, i.e., using ACBMKL to perform re-ranking, for those with which it is normally hard to achieve ideal re-ranking performance, such as the query categories “animal,” “event,” and “people.” VI. C ONCLUSIONS We present an image search re-ranking algorithm, called click-based relevance feedback, by exploring the use of clickthrough data and the fusion of multiple modalities. Particu-

larly, we leverage clicked images as pseudo-positive data and randomly select images from other queries as pseudo-negative data. After assigning a specific kernel to each modality, multiple modalities of images are loaded into the simpleMKL ensembles. Based on a gradient method, a proper combination of modality weights is learnt adaptively and querydependently. Experiments conducted on a real-world dataset not only demonstrate the usefulness of click-through data, which can be viewed as the footprints of user behavior, in understanding user intention, but also verify the importance of query-dependent fusion weights for multiple modalities. Moreover, significant performance improvement using our proposed re-ranking approach is observed in most query types in our dataset compared with other re-ranking approaches, which validates the effectiveness and superiority of our approach. In this paper, we only take image search relevance into consideration, though image diversity is another important factor in search performance. In future work, we will focus on enhancing the diversity of re-ranked images by duplication detection or other such methods. VII. ACKNOWLEDGMENTS This work was supported by the National High Technology Research and Development Program of China (2014AA015202), the National Key Technology Research and Development Program of China (2012BAH39B02), the National Natural Science Foundation of China (61173054,

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR 1

2

3

4

5

1

2

3

4

5

6

7

11 8

9

10

11

12

13

14

15

10

11

12

13

14

15

Baseline

CBs

CBRW

PRFMKL

EarlyCBRF

LateCBRF

ACBMKL

CBMKL

(a) “hurricane sandy jersey shore” 6

7

8

9

Baseline

CBs

CBRW

PRFMKL

EarlyCBRF

LateCBRF

ACBMKL

CBMKL

(b) “laos history” Fig. 8. Reranked lists from different methods of specific queries [best viewed in color]. Red boxes mark irrelevant images to the query, and blue boxes mark the fair ones.

61172153), the Beijing New Star Project on Science & Technology (2007B071). R EFERENCES [1] R. Baeza-Yates and A. Tiberi, “Extracting semantic relations from query logs,” ACM SIGKDD, pp. 76–85, 2007. [2] N. Ben-Haim, B. Babenko, and S. Belongie, “Improving web-based image search via content based clustering,” IEEE CVPRW, pp. 106– 121, 2006. [3] B. Carterette and R. Jones, “Evaluating search engines by modeling the relationship between relevance and clicks,” NIPS, 2007. [4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for SVM,” Machine Learning, vol. 46, pp. 131– 159, 2002. [5] O. Chapelle and Y. Zhang, “A dynamic bayesian network click model for web search ranking,” ACM WWW, pp. 1–10, 2009. [6] G. Dupret and C. Liao, “A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine,” ACM WSDM, pp. 181–190, 2010. [7] G. Dupret and B. Piwowarski, “A user browsing model to predict search engine click data from past observations,” ACM SIGIR, pp. 331–338, 2008.

[8] A. G. Hauptmann, W.-H. Lin, R. Yan, J. Yang, and M.-Y. Chen, “Extreme video retrieval: Joint maximization of human and computer performance,” ACM Multimedia, pp. 385–393, 2006. [9] W. H. Hsu, L. S. Kennedy, and S.-F. Chang, “Video search reranking via information bottleneck principle,” ACM Multimedia, pp. 35–44, 2006. [10] W. H. Hsu, L. S. Kennedy, and S.-F. Chang, “Video search reranking through random walk over document-level context graph,” ACM Multimedia, pp. 971–980, 2007. [11] V. Jain and M. Varma, “Learning to rerank: Query-dependent image reranking using click data,” ACM WWW, pp. 277–286, 2011. [12] K. Järvelin and J. Kekäläinen, “IR evaluation methods for retrieving highly relevant documents,” ACM SIGIR, pp. 41–48, 2000. [13] I.-H. Jhuo and D. Lee, “Boosting-based multiple kernel learning for image re-ranking,” ACM Multimedia, pp. 1159–1162, 2010. [14] T. Joachims, L. Granka, and B. Pan, “Accurately interpreting clickthrough data as implicit feedback,” ACM SIGIR, pp. 154–161, 2005. [15] L. S. Kennedy and S. Chang, “A reranking approach for contextbased concept fusion in video indexing and retrieval,” ACM CIVR, pp. 333– 340, 2007. [16] L. S. Kennedy, A. P. Natsev, and S. Chang, “Automatic discovery of query-class-dependent models for multimodal search,” ACM Multimedia, pp. 882–891, 2005. [17] Y. Liu and T. Mei, “Optimizing visual search reranking via pairwise learning,” IEEE Transactions on Multimedia, vol. 13, pp. 280–291, 2011. [18] Y. Liu, T. Mei, and X.-S. Hua, “CrowdReranking: Exploring multiple

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2346991, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

[19] [20] [21] [22]

[23]

[24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41]

search engines for visual search reranking,” ACM SIGIR, pp. 500–507, 2009. T. Mei, Y. Rui, S. Li, and Q. Tian, “Multimedia search reranking: A literature survey,” ACM Computing Surveys, vol. 46, no. 38, 2014. J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in Large Margin Classifiers, 1999. A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, pp. 2491– 2521, 2008. Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevance feedback: A power tool for interactive content-based image retrieval,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 8, no. 5, pp. 644– 655, 1998. A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349–1380, 2000. C. Snoek, K. van de Sande, O. de Rooij, and et al., “The mediamill TRECVID 2008 semantic video search engine,” NIST TRECVID Workshop, 2008. C. G. M. Snoek, M. Worring, and A. W. M. Smeulders, “Early versus late fusion in semantic video analysis,” ACM Multimedia, pp. 399–402, 2005. H.-K. Tan and C.-W. Ngo, “Fusing heterogeneous modalities for video and image re-ranking,” ACM ICMR, 2011. X. Tian, D. Tao, X.-S. Hua, and X. Wu, “Active reranking for web image search,” IEEE Trans. on Image Processing, vol. 19, no. 3, pp. 805–820, 2010. M. Wang, H. Li, D. Tao, K. Lu, and X. Wu, “Multimodal graph-based reranking for web image search,” IEEE Trans. on Image Processing, vol. 21, no. 11, pp. 4649–4661, 2012. Y. Wu, E. Y. Chang, K. C.-C. Chang, and J. R. Smith, “Optimal multimodal fusion for multimedia data analysis,” ACM Multimedia, pp. 572–579, 2004. H. Xie, Y. Zhang, J. Tan, L. Guo, and J. Li, “Contextual query expansion for image retrieval,” IEEE Trans. on Multimedia, vol. 16, pp. 1104–1114, 2014. G.-R. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan, “Optimizing web search using web click-through data,” ACM CIKM, pp. 118–126, 2004. T. Yamamoto, S. Nakamura, and K. Tanaka, “Rerank-by-example: Efficient browsing of web search results,” Database and Expert Systems Applications, vol. 4653, pp. 801–810, 2007. R. Yan and A. Hauptmann, “Query expansion using probabilistic local feedback with application to multimedia retrieval,” ACM CIKM, pp. 361–370, 2007. R. Yan, A. Hauptmann, and R. Jin, “Multimedia search with pseudorelevance feedback,” ACM CIVR, pp. 238–247, 2003. R. Yan, J. Yang, and A. G. Hauptmann, “Learning queryclass dependent weights in automatic video retrieval,” ACM Multimedia, pp. 548–555, 2004. X. Yang, Y. Zhang, T. Yao, C.-W. Ngo, and T. Mei, “Click-boosting multi-modality graph-based reranking for image search,” Multimedia Systems, May 2014. X. Yang, Y. Zhang, T. Yao, Z.-J. Zha, and C.-W. Ngo, “Click-boosting random walk for image search reranking,” ACM ICIMCS, 2013. T. Yao, C.-W. Ngo, and T. Mei, “Circular reranking for visual search,” IEEE Trans. on Image Processing, vol. 22, no. 4, pp. 1644–1655, 2013. Z.-J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang, “Visual query suggestion,” ACM Multimedia, pp. 15–24, 2009. L. Zhang, Y. Zhang, X. Gu, J. Tang, and Q. Tian, “Scalable similarity search with topology preserving hashing,” IEEE Trans. on Image Processing, vol. 23, pp. 3025–3039, 2014. X. S. Zhou and T. S. Huang, “Relevance feedback in image retrieval: A comprehensive review,” Multimedia Systems, vol. 8, pp. 536–544, 2003.

12

Yongdong Zhang (M’08-SM’13) received the Ph.D. degree in electronic engineering from Tianjin University, Tianjin, China, in 2002. He is currently a Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His current research interests are in the fields of multimedia content analysis and understanding, multimedia content security, video encoding and streaming media technology. He has authored over 100 refereed journal and conference papers. He was a recipient of the Best Paper Awards in PCM 2013, ICIMCS 2013, and ICME 2010, the Best Paper Candidate in ICME 2011. He serves as an Editorial Board Member of Multimedia Systems Journal and Neurocomputing.

Xiaopeng Yang received the B.E. degree in computer science and technology, and the M.E. degree in computer software and theory all from Shandong Normal University, Jinan, China, in 2010 and 2012, respectively. She is currently pursuing the Ph.D. degree in computer science at the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. She was an intern at Microsoft Research Asia, Beijing, China, from March to May 2013. She was the recipient of the Best Paper Award at the ACM ICIMCS in 2013. Her research interests include multimedia content analysis, multimedia information retrieval and social multimedia analysis.

Tao Mei (M’07-SM’11) is a Lead Researcher with Microsoft Research, Beijing, China. He received the B.E. degree in automation and the Ph.D. degree in pattern recognition and intelligent systems from the University of Science and Technology of China, Hefei, China, in 2001 and 2006, respectively. His current research interests include multimedia information retrieval and computer vision. He has authored or co-authored over 150 papers in journals and conferences, eight book chapters, and edited three books. He holds eight U.S. granted patents and more than 20 in pending. Dr. Mei was the recipient of several paper awards from prestigious multimedia conferences, including the Best Paper Awards at ACM Multimedia in 2007 and 2009, the Best Poster Paper Award at the IEEE MMSP in 2008, the Top 10% Paper Award at the IEEE MMSP in 2012, the Best Paper Award at ACM ICIMCS in 2012, the Best Student Paper Award at the IEEE VCIP in 2012, and the IEEE Trans. on Multimedia Prize Paper Award 2013. He was the principle designer of the automatic video search system that achieved the best performance in the worldwide TRECVID evaluation in 2007. He received Microsoft Gold Star Award in 2010, and Microsoft Technology Transfer Awards in 2010 and 2012. He is an Associate Editor of Neurocomputing and the Journal of Multimedia, a Guest Editor of the IEEE Trans. on Multimedia, the IEEE Multimedia Magazine, the ACM/Springer Multimedia Systems, and the Journal of Visual Communication and Image Representation. He is the Program Co-Chair of MMM 2013, and the General Co-Chair of ACM ICIMCS 2013. He is a Senior Member of the IEEE and the ACM.

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Image search reranking with query-dependent click-based relevance feedback.

Our goal is to boost text-based image search results via image reranking. There are diverse modalities (features) of images that we can leverage for r...
7MB Sizes 0 Downloads 14 Views