IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

2019

Click Prediction for Web Image Reranking Using Multimodal Sparse Coding Jun Yu, Member, IEEE, Yong Rui, Fellow, IEEE, and Dacheng Tao, Senior Member, IEEE Abstract— Image reranking is effective for improving the performance of a text-based image search. However, existing reranking algorithms are limited for two main reasons: 1) the textual meta-data associated with images is often mismatched with their actual visual content and 2) the extracted visual features do not accurately describe the semantic similarities between images. Recently, user click information has been used in image reranking, because clicks have been shown to more accurately describe the relevance of retrieved images to search queries. However, a critical problem for click-based methods is the lack of click data, since only a small number of web images have actually been clicked on by users. Therefore, we aim to solve this problem by predicting image clicks. We propose a multimodal hypergraph learning-based sparse coding method for image click prediction, and apply the obtained click data to the reranking of images. We adopt a hypergraph to build a group of manifolds, which explore the complementarity of different features through a group of weights. Unlike a graph that has an edge between two vertices, a hyperedge in a hypergraph connects a set of vertices, and helps preserve the local smoothness of the constructed sparse codes. An alternating optimization procedure is then performed, and the weights of different modalities and the sparse codes are simultaneously obtained. Finally, a voting strategy is used to describe the predicted click as a binary event (click or no click), from the images’ corresponding sparse codes. Thorough empirical studies on a large-scale database including nearly 330K images demonstrate the effectiveness of our approach for click prediction when compared with several other methods. Additional image reranking experiments on realworld data show the use of click prediction is beneficial to improving the performance of prominent graph-based image reranking algorithms. Index Terms— Image reranking, click, manifolds, sparse codes.

I. I NTRODUCTION UE to the tremendous number of images on the web, image search technology has become an active and challenging research topic. Well-recognized image search engines,

D

Manuscript received February 22, 2013; revised August 31, 2013 and February 20, 2014; accepted March 6, 2014. Date of publication March 11, 2014; date of current version March 31, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61100104, in part by ARC under Grants FT130101457 and DP-140102164, in part by the Program for New Century Excellent Talents in University under Grant NCET-12-0323, and in part by the Hong Kong Scholar Programme under Grant XJ2013038. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mark H.-Y. Liao. J. Yu is with the School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China (e-mail: [email protected]). Y. Rui is with Microsoft Research Asia, Peking, China (e-mail: [email protected]). D. Tao is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2311377

such as Bing [1], Yahoo [2] and [3], usually Google use textual meta-data included in the surrounding text, titles, captions, and URLs, to index web images. Although the performance of text-based image retrieval for many searches is acceptable, the accuracy and efficiency of the retrieved results could still be improved significantly. One major problem impacting performance is the mismatches between the actual content of image and the textual data on the web page [4]. One method used to solve this problem is image re-ranking, in which both textual and visual information are combined to return improved results to the user. The ranking of images based on a text-based search is considered a reasonable baseline, albeit with noise. Extracted visual information is then used to re-rank related images to the top of the list. Most existing re-ranking methods use a tool known as pseudo-relevance feedback (PRF) [34], where a proportion of the top-ranked images are assumed to be relevant, and subsequently used to build a model for re-ranking. This is in contrast to relevance feedback, where users explicitly provide feedback by labeling the top results as positive or negative. In the classification-based PRF method [35], the top-ranked images are regarded as pseudo-positive, and low-ranked images regarded as pseudo-negative examples to train a classifier, and then re-rank. Hsu et al. [36] also adopt this pseudo-positive and pseudo-negative image method to develop a clustering-based re-ranking algorithm. The problem with these methods is the reliability of the obtained pseudo-positive and pseudo-negative images is not guaranteed. PRF has also been used in graph-based re-ranking [37] and Bayesian visual re-ranking [38]. In these methods, low-rank images are promoted by receiving reinforcement from related high-rank images. However, these methods are limited by the fact that irrelevant high-rank images are not demoted. Therefore, both explicit and implicit re-ranking methods suffer from the unreliability of the original ranking list, since the textual information cannot accurately describe the semantics of the queries. Instead of related textual information, user clicks have recently been used as a more reliable measure of the relationship between the query and retrieved objects [5], [6], since clicks have been shown to more accurately reflect the relevance [7]. Joachims et al. [39] conducted an eye-tracking experiment to observe the relationship between the clicked links and the relevance of the target pages, while Shokouhi et al. [8] investigated the effect of reordering web search results based on click through search effectiveness. In the case of image searching, clicks have proven to be very reliable [7]; 84% of clicked images were relevant compared

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2020

to 39% relevance of documents found using a general web search. Based on this fact, Jain et al. [9] proposed a method which utilizes clicks for query-dependent image searching. However, this method only takes clicks into consideration and neglects the visual features which might improve the retrieved image relevance to the query. In another study, Jain and Varma [10] proposed a Gaussian regression model which directly concatenates the clicks and various visual features into a long vector. Unfortunately the diversity of multiple visual features was not taken into consideration. According to commercial search engine analysis reports, only 15% of web images are clicked by web users. This lack of clicks is a problem that makes effective click-based re-ranking challenging for both theoretical studies and real-world implementation. In order to solve this problem, we adopt sparse coding to predict click information for web images. Sparse coding is a popular signal processing method and performs well in many applications, e.g. signal reconstruction [11], signal decomposition [12], and signal denoising [13]. Although orthogonal bases like Fourier or Wavelets have been widely adopted, the latest trend is to adopt an overcomplete basis, in which the number of basis vectors is greater than the dimensionality of the input vector. A signal can be described by a set of overcomplete bases using a very small number of nonzero elements [18]. This causes high sparsity in the transform domain, but many applications need this compact representation of signals. In computer vision, signals are image features, and sparse coding is adopted as an efficient technique for feature reconstruction [14]–[16]. It has been widely used in many different applications, such as image classification [14], face recognition [15], image annotation [17], and image restoration [13]. In this paper, we formulate and solve the problem of click prediction through sparse coding. Based on a group of web images with associated clicks (known as a codebook), and a new image without any clicks, sparse coding is utilized to choose as few basic images as possible from the codebook in order to linearly reconstruct a new input image while minimizing reconstruction errors. A voting strategy is utilized to predict the click as a binary event (click or no click) from the sparse codes of the corresponding images. The overcomplete characteristic of the codebook guarantees the sparsity of the reconstruction coefficients. However, in addition to sparsity, the overcompleteness of the codebook causes loss in the locality of the features to be represented. This results in similar web images being described by totally different sparse codes, and unstable performance in image reconstruction; clicks are thus not predicted successfully. In order to address this issue, one feasible solution is to add an additional locality preserving term to the formulation of sparse coding. Laplacian sparse coding (LSC) [19], in which a locality-preserving Laplacian term is added to the sparse code, makes the sparse codes more discriminative while maintaining the similarity of features, and enhancing the sparse coding’s robustness. However, LSC [19] can only handle single feature images; in practice, web images are usually described by multiple features. For instance, commercial search engines extract and

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Fig. 1. Example images and their click number according to the queries of “bull” and “White Tiger”.

preserve different features such as color histograms, edge direction histograms, and SIFTs. Two categories of methods are used to deal with multimodal data: early fusion and late fusion [20], [50]. They differ in the way they integrate the results from feature extraction on various modalities. In early fusion, feature vectors are connected from different modalities as a new vector. However, this concatenation does not make sense due to the specific characteristics of each feature. In late fusion, the results obtained by learning for each modality are integrated, but these fused results from late fusion may not be satisfactory since results for each modality might be poor, and assigning appropriate weights to different modalities is difficult. In this paper we propose a novel method named multimodal hypergraph learning-based sparse coding for click prediction, and apply the predicted clicks to re-rank web images. Both strategies of early and late fusion of multiple features are used in this method through three main steps. First, we construct a web image base with associated click annotation, collected from a commercial search engine. As shown in Fig. 1, the search engine has recorded clicks for each image. Fig. 1(a), (b), (e), and (f) indicate that the images with high clicks are strongly relevant to the queries, while Fig. 1(c), (d), (g), and (h) present non-relevant images with zero clicks. These two components form the image bases. Second, we consider both early and late fusion in the proposed objective function. The early fusion is realized by directly concatenating multiple visual features, and is applied in the sparse coding term. Late fusion is accomplished in the manifold learning term. For web images without clicks, we implement hypergraph learning [29] to construct a group of manifolds, which preserves local smoothness using hyperedges. Unlike a graph that has an edge between two vertices, a set of vertices are connected by the hyperedge in a hypergraph. Common graph-based learning methods usually only consider the pairwise relationship between two vertices, ignoring the higher-order relationship among three or more vertices. Using this term can help the proposed method preserve the local smoothness of the constructed sparse codes. Finally, an alternating optimization procedure is conducted to explore the complementary nature of different modalities. The weights of different modalities and the sparse codes are simultaneously obtained using this optimization strategy. A voting strategy is then adopted to predict if an input image will be clicked or not, based on its sparse code.

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING

The obtained click is then integrated within a graph-based learning framework [37] to achieve image re-ranking. In summary, we present the important contributions of this paper: • First, we effectively utilize search engine derived images annotated with clicks, and successfully predict the clicks for new input images without clicks. Based on the obtained clicks, we re-rank the images, a strategy which could be beneficial for improving commercial image searching. • Second, we propose a novel method named multimodal hypergraph learning-based sparse coding. This method uses both early and late fusion in multimodal learning. By simultaneously learning the sparse codes and the weights of different hypergraphs, the performance of sparse coding performs significantly. • We conduct comprehensive experiments to empirically analyze the proposed method on real-world web image datasets, collected from a commercial search engine. Their corresponding clicks are collected from internet users. The experimental results demonstrate the effectiveness of the proposed method. The rest of this paper is organized as follows. In Section II, we briefly review some related work. The proposed method of multimodal hypergraph learning-based sparse coding is presented in Section III. Section IV presents our experimental results. In Section V we apply clicks to image re-ranking, and finally, in section VI we draw our conclusions. II. R ELATED W ORK A. Multimodal Learning for Web Images We can assume that each web image i is described by t (2) visual features as xi(1), xi , . . . , xi(t ). A normal method for handling multimodal concatenate them   features is to directly (2) into a long vector xi(1), xi , . . . , xi(t ) , but this representation may reduce the performance of algorithms [20], especially when the features are independent or heterogeneous. It is also possible that the structural information of each feature may be lost in feature concatenation [20]. In [20], the methods of multimodal feature fusion are classified into two categories, namely early fusion and late fusion. It has been shown that if an SVM classifier is used, late fusion tends to result in better performance [20]. Wang et al. have [30] provided a method to integrate graph representations generated from multiple modalities for the purpose of video annotation. Geng et al. [31] have integrated graph representations using a kernelized learning approach. Our work integrates multiple features into a graph-based learning algorithm for click prediction. B. Graph-Based Learning Methods Graph-based learning methods have been widely used in the fields of image classification [21], ranking [22] and clustering. In these methods, a graph is built according to the given data, where vertices represent data samples and edges describe their

2021

similarities. The Laplacian matrix [23] is constructed from the graph and used in a regularization scheme. The local geometry of the graph is preserved during the optimization, and the function is forcefully smoothed on the graph. However, a simple graph-based method cannot capture higherorder information. Unlike a simple graph, a hyperedge in a hypergraph links several (two or more) vertices, and thereby captures this higher-order information. Hypergraph learning has achieved excellent performance in many applications. For instance, Shashua [24] utilized the hypergraph for image matching using convex optimization. Hypergraphs have been applied to solve problems with multilabel learning [25] and video segmentation [26]. Tian et al. [27] have provided a semi-supervised learning method named HyperPrior to classify gene expression data, by using biological knowledge as a constraint. In [28], a hypergraph-based image retrieval approach has been proposed. In this paper, we construct the hypergraph Laplacian using the algorithm presented in [29]. III. M ULTIMODAL H YPERGRAPH L EARNING -BASED S PARSE C ODING FOR C LICK P REDICTION Here we present definitions of multimodal hypergraph learning-based sparse coding for click prediction, and define important notations used in the rest of the paper. Capital letters, e.g. X, represent the database of web images. Lower case letters, e.g. x, represent images and xi is the i th image of X. Superscript (i ), e.g. X(i) and x(i) , represents the web image’s feature from the i th modality. A multimodal image database with n imagesand m representations can be represented as:   t . Fig. 2 illustrates X = X(i) = x1(i) , . . . , xn(i) ∈ Rm i ×n i=1

the details of the proposed framework. First, multiple features are extracted to describe web images. Second, from these features, we construct multiple hypergraph Laplacians, and perform sparse coding based on the integration of multiple features. Meanwhile, the local smoothness of the sparse codes is preserved by using manifold learning on the hypergraphs. The sparse codes of images, and the weights for different hypergraphs, are obtained by simultaneous optimization using an iterative two-stage procedure. A voting strategy is adopted to predict the click as a binary event (click or no click) from the obtained sparse codes. Specifically, the non-zero positions in sparse code represent a group of images, which are used to reconstruct the images. If more than 50% of the images have clicks, then the image is predicted as clicked. Otherwise, the image is predicted as not clicked. Finally, a graph-based schema [37] is conducted with the predicted clicks to achieve image re-ranking. Some important notations are presented in Table I. A. Definition of Hypergraph-Based Sparse Coding   Given an image x x ∈ Rd , and web image bases with associated clicks as A = [a1 , a2 , . . . , as , ] A ∈ Rd×s , sparse coding can build a linear reconstruction of a given image x by using the bases in A: x = c1 a1 + c2 a2 + · · · + cs as = Ac. The reconstruction coefficient vector c for click prediction is

2022

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Fig. 2. The framework of multimodal hypergraph learning-based sparse coding for click prediction. First, multiple features are extracted from both the input images and image bases. Second, multiple hypergraph Laplacians are constructed, and the sparse codes are built. Meanwhile, the locality of the obtained sparse codes is preserved by using manifold learning on hypergraphs. Then, the sparse codes of the images and the weights for different hypergraphs are obtained by simultaneously optimization through an iterative two-stage optimization procedure. A voting strategy is used to achieve click data propagation. Finally, the obtained sparse codes are integrated with the graph-based schema for image re-ranking.

TABLE I

The reconstruction error is represented by the first term in (1), and the second term is adopted to control the sparsity of sparse codes c. α is the tuning parameter used to coordinate sparsity and reconstruction error. By using the sparse coding method, the web images are represented independently, and similar web images can be described as totally different sparse codes. One reason for this is the loss of the locality information in equation (1). Therefore, to preserve the locality information, the hypergraph Laplacian is utilized in (1). We adopt V , E to represent the vertex set, and the hyperedge set as G = (V, E). The hyperedge weight vector is w. Here, each hyperedge ei is assigned a weight w (ei ). A |V | × |E| incidence matrix H denotes G with the following elements:

1, if v ∈ e H (v, e) = (2) 0, if v ∈ / e.

I MPORTANT N OTATIONS AND T HEIR D ESCRIPTIONS

sparse, meaning that only a small proportion of entries in c are non-zero. c0 can be denoted as the number of nonzero entries for vector c, and sparse coding can be described as: min c0 s.t. x = Ac. However, the minimization of this problem is NP-hard. It has been proven in [18] that the minimization of l1 -norm approximates the sparsest near-solution. Therefore, most studies normally describe the sparse coding problem as the minimization of l1 -norm of the reconstruction coefficients. The objective of sparse coding can be defined as [32]: min x − Ac2 + α c1 . c

(1)

Based on H, the vertex degree of each vertex v ∈ V is w (e) H (v, e), d (v) =

(3)

e∈E

and the edge degree of hyperedge e ∈ E is δ (e) = H (v, e).

(4)

v∈V

Dv and De are used to denote diagonal matrices of vertex degrees and hyperedge degrees, respectively. Let W denote the diagonal matrix of the hyperedge weights. The value of each hyperedge’s weight is set according to the rules used in [28]. First, we construct the |V | × |V | affinity matrix 

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING

2023

 according to i j = exp − vi − v j σ 2 , where σ is the average distance among all vertices.Then, the weight for each i j . The ei is a vertex hyperedge is calculated as Wi = v j ∈ei

set, which is composed of K nearest neighbors of vertex vi . The unnormalized hypergraph Laplacian matrix [29] can be defined as follows: T L = Dv − HWD−1 e H .

(5)

Therefore, we want the sparse codes of images within the same hyperedge to be similar to each other. Summing up the pairwise distances

between the sparse codes within each hyperedge by w (e) δ (e), the hypergraph-based sparse coding can be formulated as xi − Aci 2 min c1 ,...,cn



i

i

β w (e) c p − cq 2 , ci 1 + 2 δ (e)

(6)

i

In real applications, images are described by multimodal features. with multiple features: X =  Given a dataset   t (i) (i) m (i) , in which each repreX = x1 , . . . , xn ∈ R i ×n i=1

sentation X(i) is a feature matrix from view i , we can formulate the objective function based on (7) as: t 2 (j) ci 1 min xi − A( j ) ci + α j =1 i

⎛ ⎛ ⎞ ⎞ t + βtr ⎝C ⎝ λ j L( j )⎠ CT ⎠,

(8)

for j th view. Eq. (8) can be rewritten as t 2 (j) ci 1 X − A( j )C + α F

i

⎛ ⎛ ⎞ ⎞ t t + βtr ⎝C ⎝ λ j L( j )⎠ CT⎠ s.t. λ j = 1 λ j > 0. (9) j =1

Q (ck ) =

t 2 (i) xk − A(i) ck i=1

  ∧   ∧ T  ∧ +β ckT C L + C L ck − ckT L ck .

(11)

kk

To solve the problem in (11), some derivations are conducted 2 t  (i) on the first term xk − A(i) ck of (11): i=1

t

2 (i) xk − A(i) ck

=

t  T   (i) (i) xk − A(i) ck xk − A(i) ck i=1

t   T    = xk(i)T − A(i) ck xk(i) − A(i) ck i=1 t 

   T T xk(i)T xk(i) −xk(i)T A(i) ck − A(i) ck xk(i) + A(i) ck A(i) ck

i=1

i

where L( j ) is the constructed hypergraph Laplacian matrix for weight. A( j ) =  the j th view, and  λ j is the corresponding (j) ( j) ( j) a1 , a2 , . . . , as , A( j ) ∈ Rd×s is a specified codebook

j =1

(10)

where

=

j =1

C,λ

j =1

rewritten with respect to ck as follows:

i=1

B. Multimodal Feature Combinations

min

Instead of optimizing the entire sparse code matrix C simultaneously, we optimize each ck sequentially until the whole C converges. To optimize ck , we should fix all the left sparse codes c p ( p = k), and the weights λ. Therefore, t ∧  we can obtain L = λ j L( j ). The optimization of (9) can be

ck

where ci is a vector of sparse coding. Hence, the web images connected by the same hyperedge are encoded as similar sparse codes using this formulation. The similarity among these web images within the same hyperedge is preserved. We denote X = [x1 , x2 , . . . , xn ], and C = [c1 , c2 , . . . , cn ]. Equation (6) can be rewritten as  ci 1 + βtr CLCT . (7) min X − AC2F + α

c1 ,...,cn

C. Implementations for Sparse Codes

min Q (ck ) + α ck 1 ∀k,

e∈E ( p,q)∈e

C

of entries ci are non-zero. We present implementation details of (9) in Section III.C and Section III.D. We propose an alternating optimization procedure with two stages to solve the problem in (9). First, we fix the weights λ, and optimize C. Then we fix C, and optimize λ. These two stages are iterated until the objective function converges.

j =1

The object of Equation (9) is to find a sparse linear reconstruction of the given images in X by using multiple bases from different features. The reconstruction coefficients ci for each image are sparse, which means that only a small fraction

(12) = xk − Ack 2 ,   where xk = xk(1) ; . . . ; xk(t ) ∈ R(m 1 +m 2 +···+m t )×1 and A =  (1)  A ; · · · ; A(t ) ∈ R(m 1 +m 2 +···+m t )×N . According to the method in [32], we adopt the feature-sign search algorithm to solve ck . Fig. 3 shows the details of this algorithm. We should enhance that in order to speed up the convergence of sparse codes. We initialize the sparse codes with the results of general sparse coding. After we complete the optimization of ck , C will be updated accordingly. In our experiments, C converges in only a few iterations. In Fig. 3, the matrix Ck is the submatix after removing the kth column of matrix C, and the vector Lk,−k is defined to be the subvector formed by removing the kth of vector Lk . ϒck and ϒck ck in Fig. 3 are calculated as   ∧ ∧ ∂ Q (ck ) ϒck = = 2 AT Ack −AT xk +βC−k L +β L ck , ∂ck k,−k kk   ∧ ∂ 2 Q (ck ) ϒck ck = = 2 AT A + β L I . (13) kk ∂ (ck )2

2024

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Fig. 3.

Algorithm details of implementations for sparse codes.

D. Implementations for Obtaining Weights In the second stage,wefix C to  update  λ. Therefore, t  in (9), only the term tr C λ j L( j ) CT can affect the j =1

objective function’s value, and Eq. (9) can be rewritten as ⎞ ⎞ ⎛ ⎛ t min tr ⎝C ⎝ λ j L( j ) ⎠ CT ⎠ λ

s.t.

j =1 t

λ j = 1 λ j > 0.

 λzj with z > 1. Therefore, tj =1 λ j reaches its minimal value  when λ j = 1/t related to tj =1 λ j = 1, λ j > 0. In this case, equation (14) can be reformulated as ⎛ ⎛ ⎞ ⎞ t min tr ⎝C ⎝ λzj L( j ) ⎠ CT ⎠ λ

j =1

s.t.

t

λ j = 1 λ j > 0,

(15)

j =1

(14)

j =1

The  solution to λ in (14) is λ p = 1, relating to the minimum tr CL( j )CT over different modalities, and λ p = 0 otherwise. It indicates that only one modality will be selected. Hence, this solution cannot effectively explore the complementary characteristics of different modalities. To solve this problem, the method in [33] is utilized. The λ j in (14) is replaced by

where z > 1. The constraint

t  j =1

λ j = 1 is taken into con-

sideration through the Lagrange multiplier, and the objective function in (15) can be rewritten as ⎞ ⎞ ⎞ ⎛ ⎛ ⎛ t t z j ) T λ j L( ⎠ C ⎠ −ς ⎝ λ j − 1⎠. (16)

(λ, ς ) = tr ⎝C ⎝ j =1

j =1

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING

2025

TABLE II D ETAILS OF THE R EAL -W ORLD W EB Q UERIES D ATASETS

it can be guaranteed that our method obtains state-of-art performance of time complexity. IV. E XPERIMENTAL R ESULTS AND D ISCUSSION

Fig. 4. The details of the alternating algorithm to obtain sparse codes and optimal weights.

From (λ, ς ), we can make the partial derivative with respect to λ j and ς zero. Thus, λ j can be obtained as   1/(z−1) 1/tr CL( j ) CT λj = t . (17)  1/(z−1)  1/tr CL( j ) CT j =1

The Laplacian matrix L( j ) is positive semidefinite, so we have λ j ≥ 0. When C is fixed, the global optimal of λ j can be obtained from (17). The algorithm details are listed in Fig. 4. E. Time Complexity Analysis We suppose the  experiment  is conducted  on the dataset t with (i) (i) (i) m ×N i 1 and image basis A = A = a1 , . . . , a N1 ∈ R  t i=1   (i) (i) image set X = X(i) = x1 , . . . , x N2 ∈ Rm i ×N2 . The i=1

time complexity of conducting the alternating algorithm to obtain the sparse codes for X is calculated from two parts: • The calculation of L( j ) for visual manifolds: the time t 2 complexity for this part is O i=1 m i (N2 ) . • The calculation for alternating algorithm to obtain sparse codes and optimal weights: for the update of λ, the  time complexity is O t × (N2 )2 . For the sparse coding part, we adopt efficient sparse coding algorithm [32] to calculate sparse codes c for each x, and C converges in only a few iterations. We adopt to indicate the time complexity of efficient sparse coding. Therefore, the time complexity of this part is O (T1 × (N2 )). Therefore, the entire of the  t time complexity 2 2 m + t × +T proposed algorithm is O (N ) ) (N i 2 2 1× i=1 (N2 ) × T2 , where T2 is the number of iterations in alternative optimization. T1 is less than three and T2 is less than 5 in all experiments. According to the demonstrations in [32], the time complexity of efficient sparse coding is lower than some state-of-art sparse coding methods including QP solver, a modified version of LARS [47], grafting [48], and Chen et al.’s interior point method [49]. Therefore,

To demonstrate the effectiveness of the proposed method, we conduct experiments on a real-world dataset with images collected from a commercial search engine. We compare performance of the proposed method with representative algorithms, such as single hypergraph learning-based sparse coding [19], single graph learning-based sparse coding [19], regular sparse coding [41] and the k-nearest neighbor (k−NN) algorithm. The experiments are conducted in two stages. In the first stage, we compare our method with the others for click prediction. In the second stage, we conduct experiments to test the sensitivity of the parameters. The details are provided below. A. Dataset Description We use real-world Web Queries dataset, which contains 200 diverse representative queries collected from the query log of a commercial search engine. In total, it contains 330,665 images. Table II provides details of the real-world web query datasets including the query number for each category and some examples. We select this dataset to assess our method for click prediction for two main reasons. First, the web queries and their related images originate directly from the internet, and the queries are mainly ‘hot’ (i.e. current) queries that have appeared frequently over the past six months. Second, this dataset contains real click data, making it easy to evaluate whether our method accurately predicts clicks on web images. The labels of images in the dataset are assigned according to their click counts. The images are categorized into two categories: the images of which the click count is larger than zero and the images of which the click count is zero. We represent each image by extracting five different visual features from the images including: block-wise color moments (CM), the HSV color histogram (HSV), color autocorrelogram (CO), wavelet texture (WT) and face feature. B. Experiment Configuration To evaluate the performance of the proposed method for click predication, we compare the following seven methods, including the proposed method: 1. Multimodal hypergraph learning-based sparse coding (MHL). Parameters α and β in (9) are selected by fivefold cross validation. The neighborhood size k in the hyperedge generation process and the value of z in (15) are tuned to the optimal values.

2026

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

TABLE III P ERFORMANCE C OMPARISONS OF C LASSIFICATION A CCURACY (%) FOR C LICK P REDICTIONS W ITH A F IXED S IZE OF I MAGE BASES , AND VARIED S IZE OF T EST I MAGE S ETS . T HE C OMPARISON I NCLUDES A C OMPARISON OF MHL, MGL, SHL, SGL, SC, KNN AND GP. T HE S IZE OF THE T EST I MAGE S ET I S VARIED FROM A MONG [5%, 10%, 15%, 20%, 25%], AND THE S IZE OF THE I MAGE BASE I S F IXED AT 75%. T HE R ESULTS S HOWN IN B OLD A RE S IGNIFICANTLY B ETTER T HAN O THERS

2. Multimodal graph learning-based sparse coding (MGL). Following the framework of (9), we adopt a simple graph [40] to replace the hypergraph. The parameters α and β in (9) are determined using five-fold cross validation. The neighborhood size k and the value z are tuned to optimal values. 3. Single hypergraph learning-based sparse coding (SHL) [19]. The framework in (7) is adopted for each single visual feature separately. The average performance of SHL-SC is reported and we name it SHL(A). In addition, we concatenate visual features into a long vector and conduct SHL-SC on it. The results are denoted as SHL(L). The parameters in this method are tuned to optimal values. 4. Single graph learning with sparse coding (SGL) [19]. We adopt a simple graph [40] to replace the hypergraph in (7). The performance of SGL(A) and SGL(L) are recorded. The parameters are tuned to their optimal values. 5. Regular sparse coding (SC). The sparse coding is directly conducted on each visual feature separately using Lasso [41]. The average performance of SC is reported, and denoted as SC(A). In addition, we conduct sparse coding on the integrated long vector and record the results as SC(L). 6. The k-nearest neighbor algorithm (KNN). To provide the baseline performance for the experiment, we adopt KNN for each visual feature. This is a method that classifies a sample by finding its closest samples in the training set. In this experiment, each parameter is tuned to the optimal value. The KNN(A) and KNN(L) are reported. 7. The Gaussian Process regression [10]: This method identifies a group of clicked images and conducts dimensionality reduction on concatenated visual features. A Gaussian Process regressor is trained on the set of clicked images and is then used to predict click counts for all images. This method is named “GP” in the experimental results. We randomly select images to form the image bases and test images. Since different queries contain a different number of images, it would be inappropriate to find a fixed number setting for different queries. Therefore, we choose different percentages of images to form the image bases. Specifically, the experiments are separated into two stages: the size of test image set is fixed at 5%, and the size of image base is varied from among [10%, 30%, 50%, 70%, 90%]; the size of image

base is fixed at 75%, and the size of the test image is varied from among [5%, 10%, 15%, 20%, 25%]. Besides, we conduct experiments to show the effects of different parameters. For all methods, we independently repeat the experiments five times with randomly selected image bases and report the averaged results. C. Results on Click Prediction The performance of MHL was compared with other various methods. We performed MHL, MGL, SHL, SGL and SC to obtain sparse codes for the input images, and the voting strategy was utilized to predict whether the images would be clicked or not. We obtained the classification accuracy (%) as an estimate of the result of click prediction. Table III lists the estimated average classification accuracy for the different methods. We used 75% images from each query to form the image base. The experiments were conducted under five different conditions, where the proportion of input images varied in the range of 5-25%. According to the experimental results, we observe that nearly all the used methods effectively improve on baseline comparative results. Our method, MHL, achieved the best results for click prediction, with the hypergraphbased method performing better than other single graph-based methods. The high-order information preserved by hypergraph construction is beneficial to preserving local smoothness. Compared with a normal graph, the use of the hypergraph can effectively improve click prediction performance. In addition, we observe that multimodality methods (MHL and MGL) outperformed single modality methods (SHL and SGL). This suggests that the multimodality design in equation (9), and its corresponding alternating optimization algorithm are effective in obtaining optimal classification results. Another interesting finding is that the graph-based learning methods (MHL, MGL, SHL and SGL) perform better than regular sparse coding (SC). The graph and hypergraph-based regularizers perform efficiently in obtaining excellent sparse coding results. Additionally, we found that MHL is stable when the size of the image test set increases from 5% to 25%. Table IV compares the performance of different methods when the size of the test image set is fixed at 5%, and the size of the training image bases are varied in the range of 10-90%. In general, MHL had the best performance. We observed that the classification accuracy of MHL increases from 64.8% to 66.9% when the size of the image base increases from 10% to 90% indicating that a small image base can adversely affect click prediction performance.

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING

2027

TABLE IV P ERFORMANCE C OMPARISONS OF C LASSIFICATION A CCURACY (%) FOR C LICK P REDICTIONS W ITH VARIED S IZE OF I MAGE BASES , AND F IXED S IZE OF

T EST I MAGE S ETS . T HE C OMPARISON I NCLUDES MHL, MGL, SHL, SGL, SC, KNN AND GP. T HE S IZE OF THE I MAGE BASE I S VARIED FROM A MONG [10%, 30%, 50%, 70%, 90%], AND THE S IZE OF THE T EST I MAGE S ET I S F IXED AT 5%. T HE R ESULTS S HOWN IN B OLD A RE S IGNIFICANTLY B ETTER T HAN O THERS

Fig. 5. Average classification accuracy (%) with different parameters of α, β, K and z. The sizes of image bases and test image sets are fixed at 10% and 5%, respectively. (a) Classification Accuracy versus the values of alpha. (b) Classification Accuracy versus the values of beta. (c) Classification Accuracy versus the values of K . (d) Classification Accuracy versus the values of z.

Another interesting finding in Tables III and IV is the sparse coding method dose not perform better than KNN. The reason is that the overcompleteness of sparse coding causes loss in the locality of the features. Similar web images will be described by totally different sparse codes, and the performance in click data predictions is unstable. In order to address this issue, an additional locality-preserving term with a laplacian matrix is added into the sparse coding formulation. The experimental results in Tables III and IV show that SHL and SGL perform better than KNN. D. The Effect of Changing Parameter Values In Fig. 5, we show the sensitivity of parameters α, β, K , and z in graph-based sparse coding algorithms. In these

experiments, we have fixed the percentage of the image base to 10%, and the percentage comprising the testing image set to  5%. We first fixed β to βopt , and varied α with 10−2 αopt , 10−1 αopt , αopt , 101 αopt , 102 αopt . The average classification accuracies of the methods are shown in Fig. 5(a).  We have then fixed α to αopt , and varied β with 10−2 βopt , 10−1 βopt , βopt , 101 βopt , 102 βopt . The average classification accuracies are shown in Fig. 5(b). From these figures we see that MHL performs best. From 10−2 αopt to 101 αopt , these methods perform stably, as shown in Fig. 5(a). However, from 101 αopt to 102 αopt , the performance degrades severely. In Fig. 5(b), these methods are stable when β increases from 10−2 αopt to 102 αopt . In Fig. 5(c), we observe that the hypergraph-based methods

2028

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Fig. 6. Performance comparisons of classification accuracy (%). The comparison is conducted among MHL, SHL(BOF) and SHL(LLC). The size of the image bases is fixed at 75% and the size of the test image set is varied from among [5%, 10%, 15%, 20%, 25%].

(MHL, SHL(A), and SHL(L)) obtain the highest performance when k is fixed at 10. The graph-based methods (MGL, SGL(A) and SGL(L)) have the best performance when K is set to 5. In Fig. 5(d), we have varied the parameter z from 2 to 6, and observed that the classification accuracies of MHL and MGL are highest when z is 5. E. Discussion About Features We adopt conventional features in experiments and the results are presented in Section IV.C. Recently, some stateof-the-art features have been proposed, such as the Bag of Features (BOF) based on SIFT descriptions [45], and the locality-constrained linear coding (LLC) [46], which is a SPM like descriptor. LLC has obtained state-of-the-art performance in image classification. In this part, the features of 1024 dimensional BOF and 21504 dimensional LLC with spatial block structure [1 2 4] are extracted for images. The experimental results are presented in Fig. 6. The comparison is conducted among MHL, SHL(BOF) and SHL(LLC). The size of the image bases is fixed at 75% and the size of the test image set is varied from among [5%, 10%, 15%, 20%, 25%]. The experimental results demonstrate that our proposed methodMHL performs better than BOF and LLC. It indicates that the multimodal learning adopted in MHL is effective in enhancing the classification performance. F. Experimental Results on Scene Recognition In this part, we demonstrate that MHL performs well by conducting image classification experiments on the standard dataset of Scene 15 [43], which contains 1500 images that belong to 15 natural scene categories: bedroom, CALsuburb, industrial, kitchen, livingroom, MITcoast, MITforest, MIThighway, MITinsidecity, MITmountain, MITopencountry, MITstreet, MIT-tallbuilding, PARoffice, and store. Five different features are adopted to describe the scenes, including Color Histogram (CH), Edge Direction Histogram (EDH), SIFT [45], Gist [44] and Locality-constrained Linear Coding (LLC) [46]. In our experiments, the labeled sample images in the bases are randomly selected. Since the size of each class varies significantly, it is inappropriate to find a fixed number for

Fig. 7. Performance comparisons of classification accuracy (%) for scene recognition. The comparison is conducted on MHL, MGL, SHL, SGL, SC and KNN. (a) The size of the image bases is fixed at 75% and the size of the test image set is varied from among [5%, 10%, 15%, 20%, 25%] (b) the size of the test image set is fixed at 5% and the size of the image base is varied from among [10%, 30%, 50%, 70%, 90%].

Fig. 8. The comparisons of the average NDCG measurements. From these results, we can see that the click-based method, which combines the multimodal visual features and click information, outperforms other methods.

different classes. Therefore, we fix a percentage number, which is used for choosing images from different classes to form the image bases and testing images. For the experiments on scene recognition, we compare the performance of six methods: Multimodal hypergraph learning-based sparse coding (MHL), Multimodal graph learning-based sparse coding (MGL), Single hypergraph learning-based sparse coding (SHL), Single graph learning with sparse coding (SGL), Regular sparse coding (SC) and k-nearest neighbor algorithm (KNN). The details of these six methods have been presented in Section IV.B. As shown in Fig. 7(a), our proposed method MHL outperforms other

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING

2029

Fig. 9. The top 10 images in the commercial ranking list (baseline) and the re-ranking lists obtained using the graph-based method and click-based method for the queries “Butterfly” and “Music”. The orders of the images are provided with green, blue and red indicating relevance scales 2, 1 and 0 respectively. From the figure, it is clear that the click-based method obtains the best results. (a) Results of image re-ranking for query “Butterfly”. (b) Results of image re-ranking for query “Music”.

methods in all cases. This demonstrates that the proposed method can perform well on standard datasets. V. A PPLICATION OF THE A LGORITHM FOR I MAGE R ERANKING To evaluate the efficacy of the proposed method for image re-ranking, we conduct experiments on a new dataset consisting of two subsets A and B. Subset A includes 330,665 images with 200 queries used in Section IV, and subset B contains 94925 images using the same 200 queries. Images in subset A contain associated click data, and are used to form the image base for sparse coding. The images of subset B contain the original ranking information from a popular search engine, and thus we can easily evaluate whether our approach is able to improve performance over the search engine algorithm. In subset B, each image was labeled by the human oracle according to its relevance to the corresponding query, as either “not relevant”, “relevant”, or “highly relevant”. We utilize

scores of 0, 1, and 2 to indicate the three relevance levels, respectively. We use the popular graph-based re-ranking [38] scheme in the design of our experiment. According to this framework, a ranking score list, y = [s1 , s2 , . . . , sn ]T , is a vector of ranking scores, which corresponds to an image set X = {x 1 , x 2 , . . . , x n }. The purpose of graph-based re-ranking is to calculate a new ranking score list through learning, based on the images’ visual features. Therefore, the re-ranking process can be formulated as a function f : s = f (X, s ∗ ), in which s ∗ = [s1∗ , s2∗ , . . . , sn∗ ]T is the initial ranking score list. Hence, the graph-based re-ranking can be formulated as a regularization framework as follows:   (18) arg min (s) + γ Remp (s) , s

in which (s) is the regularization term that makes the ranking scores of visually similar images close, and Remp (s) is an empirical loss. γ is a tradeoff parameter to balance the

2030

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

empirical loss and the regularizer. According to [38], (18) can be rewritten as: (19) arg min s T Ls + γ s − s ∗ , s

in which L is a Laplacian matrix. Since we can calculate a group of weights λ j for each modality, the Laplacian in (19) t  is assigned with L = λ j L( j ). j =1

Based on the graph-based framework, we compare two methods which create initial scores differently. In the first method, the initial score s ∗ is obtained using a traditional algorithm [38], which associates si∗ with the position τi using heuristic strategies: si∗ = 1 − τni . Here, n is the number of images in this query. We name it the “graph-based method”. In the second method, the predicted click information is considered along with the position. Therefore, we name this the “click-based method”. For a specified image i , if it is predicted to be clicked by the  user, then the initial score will be calculated as: si∗ = 1 − τni + 1 2. On the other hand, if the image is not predicted to be clicked the score  will be: si∗ = 1 − τni 2. The ranking results of the commercial search engine are provided as the baseline, which we name the “commercial baseline”. In order to evaluate the re-ranking performance for each query, we apply the normalized discounted cumulated gain (NDCG) [42], which is a standard measurement in information retrieval. For a given query, the NDCG at position P can be calculated as: N DC G@ p = Z p

p 2l(i) − 1 , log(1 + i )

(20)

i=1

where p is the considered depth, l(i ) is the relevance level of the i th image in the refined ranking list, and Z p is a normalization constant; this makes NDCG@p 1 for a perfect ranking. For the Web Queries dataset, Z p was calculated based on the labels provided. To compute the overall performance, NDCGs were averaged over all queries for each dataset. Fig. 8 shows average NDCG scores obtained by different methods at depths of [3, 5, 10, 20, 50]. According to these results we can see that both the graph-based method and click-based method can effectively improve baseline results. It indicates that the graph-based learning method performs well for web image re-ranking. The click-based method, which integrates all visual features and click data, achieved the best results. Specifically, it outperformed the graph-based method at all depths. It shows that by including the click data, semantic gaps can be bridged and the performance of image re-ranking improve. Fig. 9 provides the top ten returned images obtained by the three methods for two example queries. The blocks with green, blue and red indicate images’ relevance scales 2, 1 and 0 respectively. In the first and second rows of Fig. 9(a) and (b), there are many irrelevant and low relevant images, but the images of the third row are all relevant to the query. It clearly shows that click data is effective in coordinating the ranking results. VI. C ONCLUSION In this paper we propose a new multimodal hypergraph learning based sparse coding method for the click prediction

of images. The obtained sparse codes can be used for image re-ranking by integrating them with a graph-based schema. We adopt a hypergraph to build a group of manifolds, which explore the complementary characteristics of different features through a group of weights. Unlike a graph that has an edge between two vertices, a set of vertices are connected by a hyperedge in a hypergraph. This helps preserve the local smoothness of the constructed sparse codes. Then, an alternating optimization procedure is performed and the weights of different modalities and sparse codes are simultaneously obtained using this optimization strategy. Finally, a voting strategy is used to predict the click from the corresponding sparse code. Experimental results on real-world data sets have demonstrated that the proposed method is effective in determining click prediction. Additional experimental results on image re-ranking suggest that this method can improve the results returned by commercial search engines. R EFERENCES [1] Y. Gao, M. Wang, Z. J. Zha, Q. Tian, Q. Dai, and N. Zhang, “Less is more: Efficient 3D object retrieval with query view selection,” IEEE Trans. Multimedia, vol. 13, no. 5, pp. 1007–1018, Oct. 2011. [2] S. Clinchant, J. M. Renders, and G. Csurka, “Trans-media pseudo relevance feedback methods in multimedia retrieval,” in Proc. CLEF, 2007, pp. 1–12. [3] L. Duan, W. Li, I. W. Tsang, and D. Xu, “Improving web image search by bag-based reranking,” IEEE Trans. Image Process., vol. 20, no. 11, pp. 3280–3290, Nov. 2011. [4] B. Geng, L. Yang, C. Xu, and X. Hua, “Content-aware Ranking for visual search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3400–3407. [5] B. Carterette and R. Jones, “Evaluating search engines by modeling the relationship between relevance and clicks,” in Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 1–9. [6] G. Dupret and C. Liao, “A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine,” in Proc. ACM Int. Conf. Web Search Data Mining, 2010, pp. 181–190. [7] G. Smith and H. Ashman, “Evaluating implicit judgments from image search interactions,” in Proc. WebSci. Soc., 2009, pp. 1–3. [8] M. Shokouhi, F. Scholer, and A. Turpin, “Investigating the effectiveness of clickthrough data for document reordering,” in Proc. Eur. Conf. Adv. Inf. Retr., 2008, pp. 591–595. [9] V. Jain and M. Varma, “Random walks on the click graph,” in Proc. ACM SIGIR Conf. Res. Develop. Inf. Retr., 2007, pp. 239–246. [10] V. Jain and M. Varma, “Learning to re-rank: Query-dependent image re-ranking using click data,” in Proc. Int. Conf. World Wide Web, 2011, pp. 277–286. [11] E. Candès, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Commun. Pure Appl. Math., vol. 59, no. 8, pp. 1207–1223, 2006. [12] A. Eriksson and A. van den Hengel, “Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l1 norm,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 771–778. [13] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2009, pp. 2272–2279. [14] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801. [15] S. Gao, I. Tsang, and L. T. Chia, “Kernel sparse representation for image classification and face recognition,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 1–14. [16] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [17] C. Wang, S. Yan, L. Zhang, and H. Zhang, “Multi-label sparse coding for automatic image annotation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1643–1650.

YU et al.: CLICK PREDICTION FOR WEB IMAGE RERANKING

[18] D. Donoho, “For most large underdetermined systems of linear equations, the minimal l1-norm solution is also the sparsest solution,” Dept. Statist., Stanford Univ., Stanford, CA, USA, Tech. Rep., 2004. [19] S. Gao, I. Tsang, and L. Chia, “Laplacian sparse coding, hypergraph Laplacian sparse coding, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 92–104, Jan. 2013. [20] C. Snoek, M. Worring, and A. Smeulders, “Early versus late fusion in semantic video analysis,” in Proc. ACM Int. Conf. Multimedia, 2005, pp. 399–402. [21] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2004, pp. 321–328. [22] X. He, W. Y. Ma, and H. J. Zhang, “Learning an image manifold for retrieval,” in Proc. ACM Int. Conf. Multimedia, 2004, pp. 17–23. [23] M. Belkin and P. Niyogiy, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–1396, 2003. [24] R. Zass and A. Shashua, “Probabilistic graph and hypergraph matching,” in Proc. Int. Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [25] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multi-label classification,” in Proc. Int. Conf. Know. Discovery Data Mining, 2008, pp. 668–676. [26] Y. Huang, Q. Liu, and D. Metaxas, “Video object segmentation by hypergraph cut,” in Proc. Int. Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1738–1745. [27] Z. Tian, T. Hwang, and R. Kuang, “A hypergraph-based learning algorithm for classifying gene expression and array CGH data with prior knowledge,” Bioinformatics, vol. 25, no. 21, pp. 2831–2838, 2009. [28] Y. Huang, Q. Liu, S. Zhang, and D. Metaxas, “Image retrieval via probabilistic hypergraph ranking,” in Proc. Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3376–3383. [29] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs: Clustering, classification, and embedding,” in Proc. Neural Inf. Process. Syst., 2006, pp. 1601–1608. [30] M. Wang, X. S. Hua, R. Hong, J. Tang, G. Qi, and Y. Song, “Unified video annotation via multigraph learning,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 5, pp. 733–746, May 2009. [31] B. Geng, C. Xu, D. Tao, L. Yang, and X. S. Hua, “Ensemble manifold regularization,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 2396–2402. [32] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Proc. Adv. Neural Inf. Process. Syst., 2006, pp. 801–808. [33] M. Wang, X. S. Hua, X. Yuan, Y. Song, and L. R. Dai, “Optimizing multigraph learning: Towards a unified video annotation scheme,” in Proc. ACM Int. Conf. Multimedia, 2007, pp. 862–870. [34] Y. Lv and C. Zhai, “Positional relevance model for pseudo-relevance feedback,” in Proc. Int. Conf. Res. Develop. Inf. Retr., 2010, pp. 579–586. [35] R. Yan, A. Hauptmann, and R. Jin, “Multimedia search with pseudorelevance feedback,” in Proc. Int. Conf. Image Video Retr., 2003, pp. 238–247. [36] W. Hsu, L. Kennedy, and S. Chang, “Video search reranking via information bottleneck principle,” in Proc. ACM Int. Conf. Multimedia, 2006, pp. 35–44. [37] W. Hsu, L. Kennedy, and S. Chang, “Video search reranking through random walk over document-level context graph,” in Proc. ACM Int. Conf. Multimedia, 2007, pp. 971–980. [38] X. Tian, L. Yang, J. Wang, X. Wu, and X. Hua, “Bayesian visual reranking,” IEEE Trans. Multimedia, vol. 13, no. 4, pp. 639–652, Aug. 2010. [39] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay, “Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search,” ACM TOIS, vol. 25, no. 2, pp. 1–6, 2007. [40] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. Neural Inf. Process. Syst., 2004, pp. 321–328. [41] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Statist. Soc., Series B, vol. 58, no. 8, pp. 267–288, 1994. [42] K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of ir techniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422–446, 2002. [43] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Jan. 2006, pp. 2169–2178.

2031

[44] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, 2001. [45] D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [46] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. Comput. Vis. Pattern Recognit., 2010, pp. 3360–3367. [47] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 2004. [48] S. Perkins and J. Theiler, “Online feature selection using grafting,” in Proc. ICML, 2003, pp. 592–599. [49] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998. [50] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” Neural Comput. Appl., vol. 23, no. 7–8, pp. 2031–2038, 2013.

Jun Yu (M’13) received the B.Eng. and Ph.D. degrees from Zhejiang University, Zhejiang, China. He is currently a Professor with the School of Computer Science and Technology, Hangzhou Dianzi University. He was an Associate Professor with the School of Information Science and Technology, Xiamen University. From 2009 to 2011, he was with Singapore Nanyang Technological University. From 2012 to 2013, he was a Visiting Researcher with Microsoft Research Asia. Over the past years, his research interests include multimedia analysis, machine learning, and image processing. He has authored and co-authored more than 50 scientific articles. He has (Co-)Chaired for several special sessions, invited sessions, and workshops. He served as a Program Committee Member or reviewer of top conferences and prestigious journals. He is a Professional Member of the IEEE, ACM, and CCF.

Yong Rui (F’10) is currently a Senior Director at Microsoft Research Asia, leading research effort in the areas of multimedia search, knowledge mining, and social and urban computing. As a fellow of the IEEE, IAPR, and SPIE, and a Distinguished Scientist of ACM, he is recognized as a leading expert in his research areas. He holds more than 50 U.S. and International patents. He has authored 16 books and book chapters, and more than 100 referred journal and conference papers. His publications are among the most cited—his top five papers have been cited more than 6000 times and his h-Index is 43. He is the Associate Editor-in-Chief of the IEEE M ULTIMEDIA M AGAZINE, an Associate Editor of the ACM Transactions on Multimedia Computing, Communication and Applications, a founding Editor of the International Journal of Multimedia Information Retrieval, and a founding Associate Editor of the IEEE A CCESS . He was an Associate Editor of the IEEE T RANSACTIONS ON M ULTIMEDIA from 2004 to 2008, the IEEE T RANS ACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGIES from 2006 to 2010, the ACM/Springer Multimedia Systems Journal from 2004 to 2006, and the International Journal of Multimedia Tools and Applications from 2004 to 2006. He also serves on the Advisory Board of the IEEE T RANSACTIONS ON AUTOMATION S CIENCE AND E NGINEERING. He is on Organizing Committees and Program Committees of numerous conferences, including ACM Multimedia, IEEE Computer Vision and Pattern Recognition, IEEE European Conference on Computer Vision, IEEE Asian Conference on Computer Vision, IEEE International Conference on Image Processing, IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE International Conference on Multimedia and Expo, SPIE ITCom, International Conference on Pattern Recognition, and Conference on Image and Video Retrieval (CIVR). He is a General Co-Chair of CIVR 2006, ACM Multimedia 2009, and ICIMCS 2010, and a Program Co-Chair of ACM Multimedia 2006, Pacific Rim Multimedia (PCM) 2006, and IEEE ICME 2009. He is/was on the Steering Committees of ACM Multimedia, ACM ICMR, IEEE ICME, and PCM. He is the founding Chair of ACM SIG Multimedia China Chapter. Dr. Rui received the B.S. degree from Southeast University, the M.S. degree from Tsinghua University, and the Ph.D. degree from the University of Illinois at Urbana-Champaign. He also holds a Microsoft Leadership Training Certificate from Wharton Business School, University of Pennsylvania.

2032

Dacheng Tao (M’07–SM’12) is a Professor of Computer Science with the Centre for Quantum Computation and Information Systems and the Faculty of Engineering and Information Technology, University of Technology, Sydney. He mainly applies statistics and mathematics for data analysis problems in computer vision, machine learning, multimedia, data mining, and video surveillance. He has authored and co-authored more than 100 scientific articles at top venues, including IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, IEEE T RANSACTIONS ON I MAGE P ROCESSING, IEEE International Conference on Artificial Intelligence and Statistics, IEEE C OMPUTER V ISION AND PATTERN R ECOGNITION , and IEEE International Conference on Data Mining (ICDM), with the best theory/algorithm paper runner-up award in IEEE ICDM’07.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 5, MAY 2014

Click prediction for web image reranking using multimodal sparse coding.

Image reranking is effective for improving the performance of a text-based image search. However, existing reranking algorithms are limited for two ma...
3MB Sizes 0 Downloads 0 Views