1606

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Contextual Hashing for Large-Scale Image Search Zhen Liu, Houqiang Li, Wengang Zhou, Ruizhen Zhao, and Qi Tian, Senior Member, IEEE

Abstract— With the explosive growth of the multimedia data on the Web, content-based image search has attracted considerable attentions in the multimedia and the computer vision community. The most popular approach is based on the bag-of-visual-words model with invariant local features. Since the spatial context information among local features is critical for visual content identification, many methods exploit the geometric clues of local features, including the location, the scale, and the orientation, for explicitly post-geometric verification. However, usually only a few initially top-ranked results are geometrically verified, considering the high computational cost in full geometric verification. In this paper, we propose to represent the spatial context of local features into binary codes, and implicitly achieve geometric verification by efficient comparison of the binary codes. Besides, we explore the multimode property of local features to further boost the retrieval performance. Experiments on holidays, Paris, and Oxford building benchmark data sets demonstrate the effectiveness of the proposed algorithm. Index Terms— Image search, BoVW, hashing, spatial context modeling, geometric verification.

I. I NTRODUCTION

W

ITH THE explosive growth of the multimedia data on the Web, content-based image search has attracted more and more attentions in the multimedia and the computer vision community, owing to its great potential in both industry applications and research problems [1]–[10], [29]–[41]. Most approaches rely on the Bag-of-Visual-Words (BoVW) model [1], in which an image is represented Manuscript received December 23, 2012; revised June 12, 2013 and November 30, 2013; accepted January 22, 2014. Date of publication February 6, 2014; date of current version February 25, 2014. This work was supported by the National Science Foundation of China under Grant 61128007. The work of H. Li was supported by the National Science Foundation of China under Contract 61325009, Contract 61390514, and Contract 61272316. The work of W. Zhou was supported in part by the Fundamental Research Funds for the Central Universities under Contract WK2100060014 and in part by the University of Science and Technology of China under Contract KY2100000036. The work of R. Zhao was supported in part by the National High Technology Research and Development Program (863 Program) of China under Grant 2014AA015202 and in part by the Fundamental Research Funds for the Central Universities under Grant 2013JBZ003. The work of Q. Tian was supported in part by ARO under Grant W911NF-12-1-0057, in part by the Faculty Research Awards through the NEC Laboratories of America, and in part by the 2012 UTSA START-R Research Award respectively. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Richard J. Radke. Z. Liu, H. Li, and W. Zhou are with the Electrical Engineering and Information Science Department, University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]; [email protected]; [email protected]). R. Zhao is with the Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]). Q. Tian is with the Computer Science Department, University of Texas at San Antonio, San Antonio, TX 78249 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2305072

by a set of visual words [11], [13]. However, local features are usually high-dimensional. Therefore, to achieve a compact representation, the classic Bag-of-Visual-Words model defines a visual dictionary and quantizes the local features to the corresponding visual words. Visual dictionary can be constructed off-line by unsupervised clustering algorithms, typically k-means [1], hierarchical k-means (HKM) [14] or approximate k-means (AKM) [2]. In this way, an image can be represented by a set of visual words. Further, the inverted file structure, which has been successfully applied in textual information retrieval, is leveraged to index largescale image database. However, with the ignorance of the spatial context information among local features, the standard Bag-of-Visual-Words model suffers limited accuracy [2]. To tackle the problem, many local or global geometric verification methods, such as RANSAC [15], [17], weak geometric consistency [16], and geometric coding [18], [19], are proposed to check the geometric consistency among matched local features. In fact, these approaches are all based on the initial feature matches obtained by the BoVW method. If the spatial context information is embedded into the indexing structure, we can get more accurate initial feature matches and the retrieval performance will be improved. One major concern on such scheme is that the amount of visual data, which is typically high-dimensional, may be too huge to be handled considering the constraint of memory and retrieval time. Fortunately, similarity-preserving binary codes may lend itself as a potential solution to address this issue. Recently, the computer vision community has devoted a lot of efforts to the problem of learning similarity-preserving binary codes for representing large-scale image collections [29]–[38]. Encoding high-dimensional image data into compact binary codes can achieve significant gains in terms of storage and computation efficiency. Most works mainly focus on how to transform the high-dimensional feature into the binary code, but little studies are made on the spatial context information of local features. In this paper, we propose to binarize the spatial context information. In our approach, for each single feature, its surrounding features are clustered into different groups based on their spatial relationships with the center feature. A toy example is illustrated in Fig. 1. The features that locate inside the red circle are considered as meaningful neighbors of the center feature. The circle is divided into three fan regions by the blue line originating from the circle center. To make this partition invariant with rotation, we divide the circle based on the center feature’s dominant orientation. Those surrounding features that locate in the same fan region belong to the

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIU et al.: CONTEXTUAL HASHING FOR LARGE-SCALE IMAGE SEARCH

1607



We explore the multimode property of local features to improve the image search accuracy. The rest of the paper is organized as follows. In Section II, related works are reviewed. In Section III, we discuss our contextual hashing scheme in details. The indexing structure and searching scheme are introduced in Section IV and Section V, respectively. We present experimental results in Section VI. Finally, we conclude the paper in Section VII. II. R ELATED W ORKS Fig. 1. An example of the spatial context of 5 randomly selected features. The green points means the local features. The red circle represents the range of meaningful feature surrounding of the center feature. Each circle is parted into three fan regions by three blue line segments.

Fig. 2. An example of local features detected by the Hessian affine detector. Each yellow eclipse represents a local feature region. Two multimode feature points are illustrated in the red rectangles.

same group. Based on these grouped surrounding features, we generate a binary code to describe the center feature’s spatial context. In the on-line retrieval stage, we compute the Hamming distance between the spatial context binary codes (SPB) of two features if they are quantized to the same visual word. Note that our SPB is different from the Hamming Embedding approach [16]. Our method models the spatial structure between single features while Hamming Embedding targets at reducing the quantization loss. Besides the SPB, we discover that the multimode property (MMD) of local features is also useful in image search. The multimode property is that two or multiple different features are at the same location. This results from the local feature detectors (DOG [11], Hessian affine [12]) and the SIFT descriptors [11]. An example is illustrated in Fig. 2. These local feature regions are detected by the Hessian affine detector. Each yellow eclipse represents a local feature region. Two multimode features are illustrated in the red rectangles. Note that there are more than two multimode features in this picture but to get a better view we just illustrate two of them. The main contributions of this paper are summarized as follows: • We represent the spatial context of local features into binary codes for implicit geometric verification.

In the past decade, with the introduction of local features, many image search approaches are proposed based on the popular Bag-of-Visual-Words [1] model. With local features quantized to visual words, images are compactly represented by a “bag” of visual words. Further, by indexing images with the inverted file structure, scalability of image search is achieved. The spatial context information plays an important role in visual content identification. Many approaches [1]–[7], [9], [15], [18], [19], [24], [40] explore the spatial context information to improve the retrieval accuracy. These approaches can be categorized into two classes, i.e. pre-verification and postverification. Some representative approaches of each category are discussed below. The motivation of pre-verification approaches is to express the spatial context of local features into the image representation. In [25], the statistics in the local neighborhoods of local features are used to enhance the discriminative power of visual words. The statistics contain the number of neighborhood features, the average characteristic scale difference and the average dominant orientation difference between each local feature and its neighborhood features. The feature matches are weighted by the difference of these statistics. In [26], at each interest point, two features are generated: one is generated at the dominant scale and the other is at the same location but one times larger scale. [7] projects local features of an image along different directions to yield ordered spatial-bag-of-features for image search. Then some heuristic operations are exploited to achieve invariance in translation, rotation and scale changes. Some works focus on high-order visual phrase [21], [22]. In [22], geometry-preserving visual phrase is proposed to describe the spatial context of local features, including both cooccurrences and the long-range spatial layouts of visual words. Actually, it transfers the geometric verification from the postverification stage to the retrieval stage using Hough transform. The post-verification approaches aim to filter out false matches by imposing spatial consistency constraint. Some approaches are focused on local spatial consistency. The local spatial consistency of some spatially nearest neighbors is used in [1] to suppress false visual word matches. “Bundled features” [20] weights the traditional tf-idf [1], [14], by the similarity between feature bundles. The local spatial consistency is measured by projecting feature positions along horizontal and vertical directions in local MSER regions. However bundled feature method is time consuming since the spatial verification between bundles is carried out during the retrieval process.

1608

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

To capture spatial relationships of all features in the entire image, global geometric verification approaches, such as weak geometric consistency (WGC) [16], RANSAC [15], are often adopted. WGC uses a weaker global geometric model. The matches with the dominant relative scale difference and the relative orientation difference are thought to be true matches and other matches are filtered out. RANSAC-based image re-ranking achieves the state-of-art result in terms of the retrieval accuracy [2]. However RANSAC has to randomly sample many subsets of matching pairs and performs affine estimation for each subset to obtain the optimal transformation, thereby, it is computationally expensive. In practice, it is usually applied to the subset of the top-ranked candidate images to ensure efficiency. Distinguished from the above methods, in this paper, we propose to represent the spatial context information into the binary code. Then, the geometric verification is implicitly achieved by checking the Hamming distance between the corresponding binary codes, which is very efficient in implementation, and more accurate feature matches can be obtained. With more accurate feature matches, many approaches discussed above could be benefitted. III. C ONTEXTUAL H ASHING In this section, we introduce the details of our algorithm. In Section III-A, we introduce the scheme to transform a feature vector into a binary code. In Section III-B, we discuss how to model the spatial context information. We introduce the multimode property in Section III-C. In Section III-D, we discuss the scheme to integrate the spatial context information and the multimode information into a single binary code. In Section III-E, our self-contained contextual binary code is introduced. A. Binarization Scheme We first construct a visual vocabulary with M visual words. Given a feature f i , Z fi represents its contextual vector. We will introduce our scheme to generate Z fi in the following sections. A threshold vector Tk is constructed off-line for each visual word by  p q( f i )=k Z f i Tk =  k = 1...M (1) q( f i )=k 1 p

Z fi = R · Z fi

(2)

in which R is an randomly generated orthogonal matrix to p project Z fi to its low dimensional space Z fi . R is generated by applying the QR factorization to a randomly drawn matrix with Gaussian values. Instead of non-orthogonal matrix [30], the orthogonal matrix can reduce the correlation between p dimensions of Z fi . q(·) means the quantizer mapping feature to the visual word. Then given a feature f j its binary-code vector can be obtained by p

B f j = sgn(Z f j − Tk )

(3)

where sgn(·) is a sign indicator function, which is applied to each dimension of the substraction vector.

Fig. 3.

The illustration of feature surroundings.

B. Spatial Context Information Let us take feature A in Fig. 3 as an example to illustrate our scheme modeling the spatial context information. We denote l A as A’s location and o A as its dominant orientation. The coordinate system is built with l A as its origin and o A as its x axis with which the image plane is divided evenly into four parts denoted by the integer IDs. The partition is invariant to rotation changes in that each feature’s dominant orientation is used as the x axis. Note that, the image plane can be divided into any parts. We observe that good results can be obtained when three parts are used. With 120 degrees for each part, it has good tolerance to the error of dominant orientation. Those features around A are referred to as its surrounding features which carry the spatial context information. Given the assumption that truly matched features have similar surrounding features, the matches can be verified by the distance between their feature surrounding descriptors. Considering the limitation of memory and retrieval time for large-scale image search, the descriptor should be efficient to store and the distance between two descriptors should be easy to compute. Inspired by the recent works on hashing high-dimensional data into binary codes [29]– [38], we propose to represent the feature surrounding descriptor with the binary code. We elaborate on our proposed algorithm below. NA , in Let us denote A’s surrounding feature set by { f i }i=1 which N A represents the set cardinality, and compute its descriptor by the weighted sum of their SIFT descriptors.  EsAk = wi · di k = 0, 1, 2, 3, . . . (4) f i ∈S kA

where fi ∈ S kA denotes that f i locates in the kth fan region around feature A. di and wi are the SIFT descriptor and the weight of feature f i , respectively. In our experiments, we use wi = e−t ||li −l A ||

2 /s 2 i

(5)

to assign the weight for feature f i . si and li mean the scale and the location of f i , respectively. The scale normalized distance is used to obtain the scale invariance. t controls the number of surrounding features included into the descriptor. The impact of t will be discussed in the experiment section. Intuitively, we assign small weight to those features far away from A and large weight to those features near to it. Then, we concatenate all EkA into a long vector as shown by s

EsA = [EsA0 EsA1 · · · E Ak · · · ].

(6)

LIU et al.: CONTEXTUAL HASHING FOR LARGE-SCALE IMAGE SEARCH

1609

After L 2 -normalized, EsA can be transformed into binary vector with the binarization scheme of Section III-A. This strategy is denoted as SPB in the following. C. Multimode Property of Features The multimode property (MMD) refers to such a phenomenon that two or multiple different features are extracted at the same keypoint location. This comes from the algorithms detecting local features and extracting the SIFT descriptors [7]. Let’s take the Hessian affine detector as an example to explain the former. The image I (x, y) is filtered by Gaussian kernels with different scales, G(x, y, σ ), resulting in a scale-space representation f (x, y, σ ), as illustrated by f (x, y, σ ) = G(x, y, σ ) ∗ I (x, y)

(7)

where “∗” means the convolution operation. After applying Hessian matrix on f (x, y, σ ), the local extremum points are regarded as local features. For a location (x 0 , y0 ), there may exist several values of σ where f (x 0 , y0 , σ ) is the local extremum. Besides, even for a certain σ in a location (x 0 , y0 ), multiple SIFT descriptors can be extracted. We give a brief explanation about this procedure in the following. After the feature points are detected, an orientation histogram is formed from the gradient orientations of sample points within a normalized region [12]. The highest peak in the histogram is detected and regarded as the dominant orientation based on which the SIFT descriptor is computed. Usually any other local peak within eighty percent of the highest peak is also used to create a descriptor. Hence, there may be multiple SIFT descriptors computed based on different dominant orientations for a single local feature region. We observe that many images have over 50% local features with multimode property on Holidays and Paris dataset. Our strategy to use this multimode property is as follows. Given a feature A with multimode property, we add the descriptors that have the same location with it. The procedure can be illustrated by  di (8) EmA = li =l A

After L 2 -normalized, EmA can be transformed into binary vector with the binarization scheme of Section III-A. The local features without multimode property are assigned with the binary vectors of zero value. D. Contextual Binary Code We combine the spatial context information in Section III-B and the multimode property in Section III-C to generate the contextual binary code. EA =

[EmA

EsA ]

(9)

where EmA represents the multimode descriptor in Eq. (8) and EsA is feature surrounding descriptor in Eq. (6). Then, we use the binarization scheme of Section III-A to transform E A into binary vector.

Fig. 4. The illustration of indexing structure with spatial context binary signature.

E. Self-Contained Contextual Binary Code The SIFT descriptor has been proved to be informative and robust in many literatures [2], [11], [12], [16]. In the above discussions, it is not included into our binary code. In this section, we discuss how to include this information into our framework. [16] transforms each feature’s SIFT descriptor into a 64-bit binary code, which is denoted by Bh . We also follow the same strategy, but with the additional contextual binary code generated in Section III-D, which is denoted by Bc . As the binary code Bh is used to filter out some false matches, there may be some true matches been filtered or some false matches been included, especially those database features whose Hamming distance with the query feature are around the pre-defined threshold T h . In the following we refer to these features as the borderline features of Bh and they can be represented by { f j |T h − α < H (Bhf j , Bhq ) < T h +α}. H (Bhf j , Bhq ) means computing the Hamming distance between the binary codes of feature f j and the query feature q. α is a scalar value. We use Bc to perform the second filtering to those borderline features. This strategy is denoted as SCB. IV. I NDEXING W ITH B INARY C ODE The inverted file structure, which is leveraged from text retrieval, is widely used by many researchers [1]–[10] for scalable indexing. In the traditional inverted file structure, a visual vocabulary is built by clustering randomly selected feature samples and the database features are quantized to visual words. Those features quantized to the same visual word are regarded as true matches. We also adopt similar strategy with slight modification as illustrated in Fig. 4. Each visual word is followed by an entry list that contains the image IDs. And the binary code of each feature is added into it. V. S EARCHING S CHEME The searching scheme is quite similar to the voting scheme used in many content-based image search literatures [1]–[10]. We briefly review the voting scheme, as shown in Eq. (10). First, each feature in the query image, fi , is quantized to a visual word q( f i ). Second, for each image that contains the visual word q( f i ), we increment its score by the square of q( f i )’s inverted document frequency (idf ). Then the score of

1610

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

each image, Scor ek , is normalized by its norm to obtain the similarities for ranking.  2 f i ∈I q , f j ∈Ikd i d f q( f i ) · δ f i − f j  Scor ek = (10) 2  Ikd  1 if q( f i ) = q( f j ) δ fi − f j = (11) 0 otherwise where I q and Ikd represent the query image and the kth database image, respectively. δ f i − f j is a flag variable, which we use to modify this voting strategy to adapt the proposed 2 algorithm.  Ikd  means the L 2 norm of the kth database image visual word vector. We use the generated binary codes to filter out some false matches. The voting scheme can be modified into the following formulation,  1 if q( f ) = q( f ) and H (B , B ) < T i j fi fj (12) δ fi − f j = 0 otherwise

Fig. 5.

The impact of parameter t of Eq. (5). (a) Holidays. (b) Paris.

where H (B f i , B f j ) means computing the Hamming distance between the binary codes of feature f i and feature f j . δ f i − f j denotes whether the Hamming distance between feature f i and feature f j is under the pre-defined threshold T . The modified voting scheme can be expressed by Eq. (10) and (12). VI. E XPERIMENT We test the proposed algorithm in three widely used datasets in the multimedia and the computer vision community, Holidays [23], Paris [28] and Oxford building [2]. Holidays dataset contains about 1.5K images and 4.5M descriptors. The author of [23] has provided well trained vocabularies and the extracted local features. Paris and Oxford building dataset consist of about 6.4K and 5K images collected from Flickr, respectively. We first study the impact of different parameters in the proposed algorithm. Then to evaluate the performance on large-scale retrieval, we build the distractor dataset by crawling 1M images from the Web. The software released in [23] is used to extract the Hessian affine SIFT [12] features with default parameters. We also adopt the visual vocabulary in [23] which is trained with k-means algorithm from the randomly sampled SIFT features of those images downloaded from Flickr. SIFT features are quantized to visual words by approximate nearest neighbor method with the software [28]. Mean average precision (mAP) [2] is adopted to evaluate the retrieval performance. Our experiments are implemented on a server with 32G memory, 2.4GHz CPU of Intel Xeon. A. Performance of Spatial Context Binary Code In this section, we evaluate the performance of our spatial context binary code (SPB) method (algorithm details in Section III-B). To simplify the testing procedure, in the following experiments, we use the 20K visual vocabulary provided in [23] to quantize feature descriptors to visual words. Note that the mAP performance of the BoVW approach is 0.451 and 0.322 for Holidays and Paris dataset under the above settings, respectively.

Fig. 6. The impact of the length of the spatial context binary code under different threshold. (a) Holidays. (b) Paris.

The impact of parameter t in Eq. (5) is shown in Fig. 5. When t −1 is small, less spatial context information is recorded and when t −1 is too large, more noisy spatial context information is included. As revealed in Fig. 5, we can get the best performance when t equals 0.5 for both Holidays and Paris dataset and these two curves have quite similar trends. Compared with the baseline approach, we get 0.164 and 0.101 mAP performance improvement for Holidays and Paris dataset, respectively. Fig. 6 illustrates the impact of SPB under different threshold T of Eq. (12). It can be observed that with more bits used to represent the spatial context, we can obtain better mAP performance. But when 128 bits are used, the best mAP performance is obtained with different thresholds: for Holidays dataset, T = 42; for Paris dataset, T = 33. In our experiments, we select 64 bits to represent the spatial context information, because the best mAP performance is gotten at about T = 19 for both Holidays and Paris dataset.

LIU et al.: CONTEXTUAL HASHING FOR LARGE-SCALE IMAGE SEARCH

Fig. 7. The impact of how many parts the image plane is divided into on Holidays and Paris dataset.

1611

Fig. 9. The performance of our contextual binary code (CB) of Section III-D on Holidays and Paris dataset.

Fig. 8. The impact of the length of the multimode binary code under different threshold. (a) Holidays. (b) Paris.

Fig. 10. The impact of α in our self-contained contextual binary code (SCB) on Paris dataset. (a) Holidays. (b) Paris.

The impact of the image plane division is shown in Fig. 7. It can be observed that the minor mAP improvement can be obtained when the image plane is divided into more than 3 parts. As more image plane division will introduce higher computational complexity, it is divided into 3 parts in the following.

we illustrate the performance of our strategy to compress these two kinds of information into a single binary code (more details in Section III-D). From Fig. 9, it can be observed that we can get better mAP performance with the SPB approach than with the MMD approach. The performance can be improved further when we combine the SPB and the MMD approach together. Namely, the mAP of CB is improved by 0.03 and 0.048 compared to the SPB approach and the MMD approach on Holidays, respectively. And it is 0.022 and 0.033 on Paris, respectively. That the multimode property can be used to obtain the similar result with the spatial context information also gives an evidence of its importance.

B. Performance of Multimode Binary Code The multimode phenomenon (MMD) is illustrated in Section III-C. The parameters that affect the performance of our approach are the binary-code length and the threshold T of Eq. (12). Fig. 8 illustrates the impact of MMD under different thresholds. It can be seen that better mAP performance can be obtained with more bits. But less gain is gotten when it increases from 64 bits to 128 bits than that from 32 bits to 128 bits. We get about 0.146 and 0.09 mAP improvement compared to the BoVW approach on Holidays and Paris dataset, respectively. C. Performance of Contextual Binary Code As shown in Section VI-A and Section VI-B, the spatial context information and the multimode information are quite useful to improve the retrieval performance. In this section,

D. Performance of Self-Contained Contextual Binary Code In this section, we demonstrate the impact of α in our self-contained contextual binary code (SCB) of Section III-E. α controls the number of borderline features. As suggested in [16], we set T h as 20. Fig. 10 shows the impact of α on Holidays and Paris dataset. It can be observed that when α gets larger, the mAP performance first increases and then drops. When α is small, there is more emphasis on feature’s SIFT information. When α is large, more emphasis is made on feature’s contextual

1612

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

TABLE I T HE M AP P ERFORMANCE ON 3 D IFFERENT D ATASETS OF D IFFERENT

TABLE II T HE T IME AND M EMORY C OST OF E ACH M ETHODS M ENTIONED A BOVE

M ETHODS . T HE A LGORITHM D ETAILS OF CB AND SCB A RE

W HEN 1M D ISTRACTORS A RE A DDED . T HE A LGORITHM D ETAILS OF

IN

S ECTION III-D AND S ECTION III-E, R ESPECTIVELY

information. In the following we set α as 3 to get better performance both for Holidays and Paris dataset. We also demonstrate the results on another popular dataset, Oxford building [2] in Table I. The performance of some other related works are also given when similar experiment setting is adopted. The spatial-bag-of-features [7], vocabulary tree based contextual weighting [25], weak geometric consistency [16], and Hamming embedding [16] are denoted as SBoF, VTCW, WGC, and HE, respectively. The results of SBoF and VTCW are taken from the original paper. We implement WGC and HE and obtain the similar results to which reported in the original paper. SBoF projects the spatial distribution of visual words in an image to some pre-defined spatial configurations and several sub-configurations are selected from a learning set. VTCW uses the statistics of features’ scales and orientations around each center feature to weight the matches. Table I shows the comparison results of the above methods and our approach on the three datasets. The mAP performance of our approach is better than or comparable with those comparison algorithms. It should be noted that, both our approach and the HE [16] are focused on different stages from the other comparison algorithms in the framework of BoVW-based image retrieval. In other words, our approach as well as HE [16] are dedicated to improve the accuracy of local feature matching. In contrast, those other comparison algorithms proceed based on the initial matching results. As the proposed algorithm can get more accurate feature matches, it can further improve the performance of the above algorithms when combined with them. Our approach is most related to the HE approach [16] in that binary codes are generated for feature matching verification. The difference lies in that different information is used to generate the binary code. HE exploits the SIFT descriptor to generated the binary code while in our approach, the binary signature is obtained based on the contextual information of each local feature. HE method can also be integrated into our approach to further boost the retrieval accuracy, as demonstrated by our SCB discussed in Section III-E. E. Performance on Large-Scale Dataset To evaluate the performance of large-scale image search, the common practice is to employ a large image database as distractors to the ground truth data [2], [14], [20], [27].

MMD, SPB, CB, SCB A RE IN S ECTION III-C, S ECTION III-B, S ECTION III-D AND S ECTION III-E, R ESPECTIVELY

Fig. 11. The illustration of the proposed algorithm with different size of distractors included into the ground truth. (a) Holidays. (b) Paris.

We follow the same scheme with five different database sizes tested, 50K, 100K, 200K, 500K, 1M. From Fig. 11, for all the algorithms, it is obvious that the retrieval performance degrades gradually while increasing the size of distractor set. When 1M distractors are added, compared with the BoVW approach, we get 0.166 mAP and 0.063 mAP performance improvement on Holidays and Paris dataset, respectively. In Table II, we present the time cost of our implemented algorithms in the on-line retrieval stage. The pre-processing time cost is the time to generate the binary codes for HE, MMD, SPB, CB, SCB approaches. It mainly comes from the computation of feature surrounding descriptors and dimensional reduction. As there are less surrounding features and lower dimensional feature surrounding descriptors, HE and MMD approaches need lower time cost than SPB, CB, SCB approaches. The pre-processing can be finished with about 100 milliseconds. As to the query time cost, all binary codes based approaches are lower than the BoVW approach. Since it

LIU et al.: CONTEXTUAL HASHING FOR LARGE-SCALE IMAGE SEARCH

is more efficient to compute the Hamming distance between the binary codes than to update the floating point type score, the more false matches are filtered out, the less query time will be needed. The memory cost of the index file for the approaches mentioned in Fig. 11 is shown in Table II. For the BoVW method, 4 bytes are used to record the image ID of each feature. For WGC approach, we use two additional bytes to record the log scale and the orientation of each feature. For the HE, MMD, SPB, and CB methods, each feature needs 4 bytes to record the image ID and 8 bytes for the 64-bit binary code. For SCB approach, besides the image ID, we record a 64-bit binary code of HE method and a 64-bit binary code of CB method. Hence, it needs 20 bytes for each feature. VII. C ONCLUSION In this paper, we propose an algorithm to represent feature’s spatial context information with a binary code. In addition, we discover that the multimode property is useful to improve the retrieval performance. The experiments on benchmark Holidays, Paris and Oxford building dataset verify the effectiveness of the proposed algorithm. Since there are a lot of features detected in a single image, about 3000 features for Holidays images and 4000 features for Paris images, it is memoryprohibitive to extend the proposed algorithm to index billionscale or larger size image database. In our future work, we will investigate how to reduce the number of features in a single image while the retrieval performance is preserved. R EFERENCES [1] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Dec.2003, pp. 1470–1477. [2] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2007, pp. 1–8. [3] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2007, pp. 1–8. [4] W. Zhou, H. Li, Y. Lu, and Q. Tian, “Principal visual word discovery for automatic license plate detection,” IEEE Trans. Image Process., vol. 21, no. 9, pp. 4269–4279, Sep. 2012. [5] W. Zhou, Q. Tian, Y. Lu, L. Yang, and H. Li, “Latent visual context learning for web image applications,” Pattern Recognit., vol. 44, no. 10, pp. 2263–2273, 2011. [6] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian, “Towards codebook-free: Scalable cascaded hashing for mobile image search,” IEEE Trans. Image Multimedia, vol. 16, no. 3, pp. 1–11, Aug. 2014. [7] Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang, “Spatial-bag-offeatures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3352–3359. [8] S. Zhang, Q. Tian, K. Lu, Q. Huang, and W. Gao, “Edge-sift: Discriminative binary descriptor for scalable partial-duplicate mobile search,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2889–2902, Jul. 2013. [9] S. Zhang, Q. Huang, G. Hua, S. Jiang, W. Gao, and Q. Tian, “Building contextual visual vocabulary for large-scale image applications,” in Proc. ACM Int. Conf. Multimedia, 2010, pp. 501–510. [10] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, “Semantic-aware co-indexing for image retrieval,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Apr. 2013, pp. 1–8. [11] D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [12] K. Mikolajczyk and C. Schmid, “An affine invariant interest point detector,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2002, pp. 128–142.

1613

[13] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” in Image Vis. Comput., vol. 22, no. 10, pp. 761–767, 2004. [14] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Feb. 2006, pp. 2161–2168. [15] M. Fischler and R. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981. [16] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2008, pp. 304–317. [17] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jan. 2008, pp. 1–8. [18] W. Zhou, H. Li, Y. Lu, and Q. Tian, “SIFT match verification by geometric coding for large-scale partial-duplicate web image search,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 9, no. 1, pp. 1–4, 2013. [19] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for large scale partial-duplicate web image search,” in Proc. ACM Int. Conf. Multimedia, 2010, pp. 511–520. [20] Z. Wu, Q. Ke, M. Isard, and J. Sun, “Bundling features for large scale partial-duplicate web image search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 25–32. [21] J. Gao, Y. Hu, J. Liu, and R. Yang, “Unsupervised learning of high-order structural semantics from images,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2009, pp. 2122–2129. [22] Y. Zhang, Z. Jia, and T. Chen, “Image retrieval with geometry-preserving visual phrases,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 809–816. [23] (2008). Holidays dataset and Hessian-Affine Detecor [Online]. Available: http://lear.inrialpes.fr/people/jegou/data.php [24] O. Chum, M. Perdoch, and J. Matas, “Geometric min-hashing: Finding a (thick) needle in a haystack,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 17–24. [25] X. Wang, M. Yang, T. Cour, S. Zhu, K. Yu, and T. Han, “Contextual weighting for vocabulary tree based image retrieval,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 209–216. [26] Z. Wu, Q. Ke, J. Sun, and H. Shum, “A multi-sample, multi-tree approach to bag-of-words image representation for image retrieval,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2009, pp. 1992–1999. [27] H. Jegou, M. Douze, and C. Schmid, “On the burstiness of visual elements,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 1169–1176. [28] (2006). FastANN Code and Paris Dataset [Online]. Available: http://www.robots.ox.ac.uk/ vgg [29] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proc. Int. Conf. Very Large Data Bases, 1999, pp. 518–529. [30] M. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proc. ACM Symp. Theory Comput., 2002, pp. 380–388. [31] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Oct. 2009, pp. 2130–2137. [32] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 817–824. [33] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” Neural Inf. Process. Syst., vol. 22, pp. 1509–1517, Dec. 2009. [34] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image databases for recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2008, pp. 1–8. [35] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Neural Inf. Process. Syst., 2008, pp. 1753–1760. [36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 2564–2571. [37] W. Zhou, Y. Lu, H. Li, and Q. Tian, “Scalar quantization for large scale image search,” in Proc. 20th ACM Int. Conf. Multimedia, 2012, pp. 169–178. [38] J. Wang, S. Kumar, and S. Chang, “Sequential projection learning for hashing with compact codes,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 1127–1134.

1614

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

[39] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, 2001. [40] Z. Liu, H. Li, W. Zhou, and Q. Tian, “Embedding spatial context information into inverted file for large-scale image retrieval,” in Proc. ACM Int. Conf. Multimedia, 2012, pp. 199–208. [41] L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Lp-norm IDF for large scale image search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 1626–1633.

Zhen Liu received the B.S. degree in electronic information engineering from the Department of Electronic Engineering and Information Science from the University of Science and Technology of China, Hefei, China, in 2010, where he is currently pursuing the Ph.D. degree in signal and information processing. His current research interests include image/video processing, multimedia information retrieval, and computer vision.

Houqiang Li received the B.S., M.Eng., and Ph.D. degrees in electronic engineering from the University of Science and Technology of China (USTC) in 1992, 1997, and 2000, respectively. He is currently a Professor with the Department of Electronic Engineering and Information Science, USTC. His current research interests include multimedia search, image/video analysis, and video coding and communication. He has authored or co-authored over 100 papers in journals and conferences. He served as an Associate Editor of the IEEE T RANS ACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY from 2010 to 2013, and has been with the Editorial Board of the Journal of Multimedia since 2009. He has served on technical/program committees, organizing committees, and as a Program Co-Chair, a Track or Session Chair for over 10 international conferences. He was the recipient of the Best Paper Award for Visual Communications and Image Processing in 2012, the International Conference on Internet Multimedia Computing and Service in 2012, and the International Conference on Mobile and Ubiquitous Multimedia from ACM in 2011, and a Senior Author of the Best Student Paper of the 5th International Mobile Multimedia Communications Conference (MobiMedia) in 2009.

Wengang Zhou received the B.E. degree in electronic information engineering from Wuhan University, China, in 2006, and the Ph.D. degree in electronic engineering and information science from the University of Science and Technology of China, China, in 2011. He was a Research Intern with Internet Media Group, Microsoft Research Asia from 2008 to 2009. From 2011 to 2013, he was a Post-Doctoral Researcher with Computer Science Department, University of Texas at San Antonio. He is currently an Associate Professor with the Department of Electronic Engineering and Information Science, University of Science and Technology of China. His current research interest include multimedia information retrieval. He received the Best Paper Award from ACM ICIMCS in 2012. Ruizhen Zhao received the B.S. and Ph.D. degrees in applied mathematics from Xidian University, Xian, China, in 1997 and 2001, respectively. He is currently a Professor with the Institute of Information Science, Beijing Jiaotong University. His current research interests include compressed sensing and sparse representation, and their applications to image restoration, video target tracking, and pattern recognition.

Qi Tian (M’96–SM’03) received the B.E. degree in electronic engineering from Tsinghua University, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University in 1996, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign in 2002. He is currently a Professor with the Department of Computer Science, University of Texas at San Antonio (UTSA). He took a one-year faculty leave with Microsoft Research Asia from 2008 to 2009. Dr. Tian’s current research interests include multimedia information retrieval and computer vision. He has published over 220 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA, and he received faculty research awards from Google, NEC Laboratories of America, FXPAL, Akiira Media Systems, and HP Laboratories. He received the Best Paper Awards in PCM 2013, MMM 2013, and ICIMCS 2012, the Top 10% Paper Award in MMSP 2011, the Best Student Paper in ICASSP 2006, and the Best Paper Candidate in PCM 2007. He received the 2010 ACM Service Award. He is the Guest Editors of the IEEE T RANSACTIONS ON M ULTIMEDIA, the Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, the EURASIP Journal on Advances in Signal Processing, the Journal of Visual Communication and Image Representation, and is with the Editorial Board of the IEEE T RANSACTIONS ON C IRCUIT AND S YSTEMS FOR V IDEO T ECHNOLOGY , Multimedia Systems Journal, the Journal of Multimedia, and the Journal of Machine Visions and Applications.

Contextual hashing for large-scale image search.

With the explosive growth of the multimedia data on the Web, content-based image search has attracted considerable attentions in the multimedia and th...
2MB Sizes 1 Downloads 3 Views