L(p) -norm IDF for scalable image retrieval.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2014.2329182, IEEE Transactions on Image Processing JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012

1

Lp-norm IDF for Scalable Image Retrieval Liang Zheng, Shengjin Wang1 , Member, IEEE, and Qi Tian2 , Senior Member, IEEE

Abstract—The Inverse Document Frequency (IDF) is prevalently utilized in the Bag-of-Words based image retrieval application. The basic idea is to assign less weight to terms with high frequency, and vice versa. However, in the conventional IDF routine, the estimation of visual word frequency is coarse and heuristic. Therefore, its effectiveness is largely compromised and far from optimal. To address this problem, this paper introduces a novel IDF family by the use of Lp -norm pooling technique. Carefully designed, the proposed IDF takes into account the term frequency, document frequency, the complexity of images, as well as the codebook information. We further propose a parameter tuning strategy which helps produce optimal balancing between TF and pIDF weights, yielding the so-called Lp -norm IDF (pIDF). We show that the conventional IDF is a special case of our generalized version, and two novel IDFs, i.e., the average IDF and the max IDF, can be defined from the concept of pIDF. Further, by counting for the term-frequency in each image, the proposed Lp -norm IDF helps to alleviate the visual word burstiness phenomenon. Our method is evaluated through extensive experiments on four benchmark datasets (Oxford 5K, Paris 6K, Holidays, and Ukbench). We show that the Lp -norm IDF works well on large scale databases and when the codebook is trained on irrelevant data. We report an mAP improvement of as large as +13.0% over the baseline TF-IDF approach on a 1M dataset. Moreover, the Lp -norm IDF has a wide application scope varying from buildings to general objects and scenes. When combined with post-processing steps, we achieve competitive results compared to the state-of-the-art methods. In addition, since the Lp -norm IDF is computed offline, no extra computation or memory cost is introduced to the system at all. Index Terms—Image retrieval, Lp -norm IDF, burstiness, visual word frequency.

I. I NTRODUCTION ITH the exponential growth of multimedia data on the web across the world, there is an increasing demand in efficient indexing and retrieval of these data, especially image data. Due to the limitations of methods based on surrounding texts or tags [1], [2], recent years have witnessed the rapid

W

This work was supported by the National High Technology Research and Development Program of China (863 program) under Grant No. 2012AA011004 and the National Science and Technology Support Program under Grant No. 2013BAK02B04. This work also was supported in part to Dr. Qi Tian by ARO grant W911NF-12-1-0057, Faculty Research Awards by NEC Laboratories of America, 2012 UTSA START-R Research Award, and in part by National Science Foundation of China (NSFC) 61128007. L. Zheng and S. Wang are with the State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Electrical Engineering, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]). Q. Tian is with the Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, 78249-1604, USA (e-mail: [email protected]). 1 To whom correspondence may be addressed: Shengjin Wang 2 To whom correspondence may be addressed: Qi Tian

1

2

3

4

5

6

x y

Fig. 1. An toy example of an image collection. Visual words zx and zy both occurs in all the six images, but with varying TF distributions over the entire image collection. In conventional IDF, the IDF weights are equal to zero for both words. But when resorting to TF, zx and zy both have some discriminative power, a problem to be tackled in this paper.

progress in the field of Content-based Image Retrieval (CBIR). To this end, this paper considers the task of large scale image retrieval. Given a query image, our goal is to retrieve from a large image database all the images containing the same object or scene with the query in real time. To perform such a research topic, a myriad of models [3]– [5] have been proposed in the last decade. Among them, the Bag-of-Words (BoW) based method [6] is the most popular and perhaps the most successful one. The BoW model starts from the extraction of salient local regions from an image and representing each local patch as a high-dimensional feature vector (e.g., SIFT [7] or its variants [8], [9]). Then the continuous high dimensional feature space is divided into a discrete space of visual words. This step is achieved by constructing a codebook through unsupervised clustering, e.g., k-means algorithm. To improve efficiency, approximate kmeans (AKM) [3] and hierarchical k-means (HKM) [4] have been proposed. The Bag-of-Words model then treats each cluster center as a visual word in the codebook. The BoW model has been successfully applied in text applications [10], [11]. In image retrieval, this model quantizes each local feature to its nearest visual word(s) and represents each image as a sparse histogram of visual words. Finally, images are ranked using various indexing methods [6], [12] and TF-IDF weights in real time. To measure the importance of visual words, most of the existing approaches use the TF-IDF (Term Frequency-Inverse Document Frequency) weighting scheme. Term Frequency (TF) is a very simple measure to weight each visual word in an image. For TF, each visual word is assumed to have an importance proportional to the number of times it occurs

1057-7149 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


in an image. While TF concerns term occurrence within an image, Inverse Document Frequency (IDF) [13] is about term occurrence across a collection of images. It involves computing a weight for each visual word according to its occurrence frequency across images. Intuitively, IDF assigns less weight to frequent visual words, weakening the impact of these words, and more weight to rare visual words, which are considered more discriminative. However, the conventional IDF method has two drawbacks. First, the conventional IDF takes as an estimation of visual word frequency the number of images where this visual word appears. Albeit simple, this strategy functions on the image collection level: it does not take a closer look into the visual word level, where multiple occurrences of a visual word in a single image are often observed. Consequently, the conventional IDF only makes a coarse estimation of visual word frequency. Second, as suggested in [14], IDF weighting does not address the problem of visual word burstiness, which refers to the phenomenon where certain visual words appears many times in an image due to the existence of repetitive visual elements. In contrast to our conventional wisdom that burstiness mainly exists in man-made objects such as buildings, it actually happens prevalently in images, as to be shown in Section IV. During image retrieval, if the query image happens to contain the visual word that occurs in bursts in an irrelevant image, this image would get an undesirably high score. As a consequence, burstiness brings about a burst in false matches and compromises the image retrieval accuracy (see Fig. 2). In this paper, we propose a novel IDF formula, called “Lp norm IDF”, which makes a careful estimation of visual word frequency and achieves significant improvement in retrieval performance. The key idea is that the estimated visual word frequency is the weighted sum of the TF data across the whole database. We show that the conventional IDF is a special case of our generalized IDF family. Meanwhile, two other novel IDFs, termed average IDF and max IDF, can be defined with the concept of pIDF. Experimental studies on four image retrieval datasets confirm that by integrating the term frequency into IDF using the proposed method, image retrieval performance is improved dramatically. When post-processing steps are employed, we are capable of producing competitive results compared to the state-of-the-art methods. Particularly, we further show that the Lp -norm IDF is effective on large scale experiments, where a +13.0% absolute improvement is observed on the Holidays + 1M dataset. Moreover, experimental results reveal that the Lp -norm IDF can not only be applied to retrieval of man-made objects such as buildings in Oxford and Paris datasets, but to general objects or scenes in Ukbench and Holidays datasets as well. Last but not least, given an image database, since the Lp -norm IDF is computed offline, no extra computational cost is introduced and efficiency is ensured. Generally speaking, our method is applicable when the database is given offline. When the database changes dynamically, i.e., images are being added or removed, we simply re-calculate the IDF score in a “batch” manner, and the performance can be ensured. A preliminary version of this paper appeared as [15].

2

Fig. 2. Examples of image matching pairs affected by burstiness. Query images are on the left column. The image on the right are irrelevant images in which visual word burstiness occurs. These burstiness visual words lead to multiple matches between the two images, thus compromising the image retrieval accuracy.

Compared with [15], extensions are made on four aspects. • A more detailed description is presented in “Introduction” and “Related Work” sections. • The parameter tuning process is re-formatted and discussed in more detail. • We conduct experiments on a broader range of datasets, demonstrating the wide application scope of our method. • We extend the current system with various postprocessing techniques, and we are capable of producing competitive performance to the state-of-the-arts. The rest of the paper is organized as follows. After a brief review of related work in Section II, we introduce the proposed Lp -norm IDF formula in Section III. Visual word burstiness is illustrated in Section IV. In Section V, we demonstrate the experimental results of our method. Finally, conclusions are drawn in Section VI. II. R ELATED W ORK Built on the Bag-of-Words (BoW) model, a large body of literature have been proposed to improve its performance. In this section, we briefly review several closely related aspects, i.e., codebook construction, feature quantization, spatial constraint encoding, and visual word weighting.



Codebook Construction. The codebook can be viewed as a partition of the continuous high-dimensional local feature space. Partitioning by a regular lattice [16] is a simple scheme, but is shown to yield poor results in image retrieval [3], due to its data-independent nature. Therefore, unsupervised clustering algorithms such as k-means [6] are typically used to construct visual codebooks. To speed up this process, hierarchical kmeans (HKM) [4], approximate k-means (AKM) [3], and their variations [17], [18] are employed. To fill the performance gap when the codebook is trained on irrelevant data, Xie et al. [19] introduce the weighted visual vocabulary on general datasets. To further improve the discriminative power of codebooks, some works also exploit supervised information [20]–[22]. For example, Cai et al. [21] embed semantic information into the codebook by learning multiple sub-dictionaries with specific semantic meaning. Lopez et al. [23] integrate oversegmentation algorithms and spatial grids into the clustering process to narrow the semantic gap between visual words and visual concepts. Zheng et al. [24] design a Bayesian approach to alleviate codebook correlation and achieve effective codebook merging. Feature Quantization. Typically, hundreds or thousands of local features are detected in an image. These features are further quantized to certain visual word(s) in the codebook via approximate nearest neighbor algorithms [3], [4], [9]. The quantization process is prone to a large information loss: a 128-D double SIFT feature is transformed into a single integer. This problem leads to visual word uncertainty and ambiguity [25], [26]. To deal with quantization error, soft quantization [27]–[29] is employed, assigning multiple visual words at a cost of increased the query time and memory overload. Similar in nature, sparse coding [30] calculates a linear combination of visual words to reconstruct the original feature. Constraint quantization [31] neglects features which lie far from the cluster centroids, improving efficiency while remaining high retrieval accuracy. Matching Verification. The BoW model treats an image as a bag of coarse visual words, which produce false positive matches very often due to the low discriminative power. On one hand, spatial cues are proved to be effective in various tasks [32]–[35]. In image retrieval, spatial constraints among features are considered [36]–[38]. The geometric context among local features can be also encoded into visual word assemblies [36], [39]–[41]. In [39], supervised approaches are employed, yielding a visual phrase vocabulary. Another promising strategy is to make use of binary features, or hash codes [42], [42] to provide complementary cues. For example, Hamming embedding [43] rebuilds the discriminative power of SIFT by providing binary signatures to filter out false matches. Some recent works, e.g., [44] investigate the bit-wise nature of binary features and design codebook-free quantization schemes. Complementary evidences also include local features [45], [46], global features [47], [48], and multi-level ones [49], [50]. It is shown in [45] that feature fusion is capable of achieving state-of-the-art performance on benchmark datasets. Visual Word Weighting. The BoW model represents an image as an visual word histogram, each entry of which is assigned a weight. The default weighting method involves

3

the TF-IDF scheme [13]. A variant of TF-IDF includes [14], which addresses the intra- and inter-image burstiness problem using a combination of modified TF-IDF formulas. On the other hand, visual word weighting can also be derived from other clues, such as quantization [27], [30], spatial context [51], [52], semantic information [53], [54], etc. Quantization based weighting approach represents each local feature as a linear combination of visual words, which are then accumulated into the final histograms. Wang et al. [51] propose to incorporate the information of both the vocabulary tree and the image spatial domain into the contextual weighting. Su et al. [53] predict a set of semantic attributes for the entire image as well as for local regions, and use these predictions to build the visual word histograms. Our work, in its nature, focuses on the generalization of TF-IDF weighting scheme. The visual word frequency is re-estimated by Lp -norm pooling in an offline manner. We then optimize the parameter to achieve a good balance between TF and IDF weights. III. P ROPOSED A PPROACH This section gives a formal description of our proposed Lp norm IDF formula. Assume that an image collection possesses N images, denoted as D = {Ii }N i=1 . Each image Ii has a set i of keypoints {xj }dj=1 , where di is the number of keypoints in Ii . Given the codebook {zk }K k=1 with a vocabulary size of K, image Ii is quantized into a vector representation vi = [vi,1 , vi,2 , ..., vi,K ]T , where vi,k stands for the response of visual word zk in Ii . A. TF-IDF Revisit The TF-IDF weighting scheme is a classical method used in the field of text retrieval and classification. It was adopted in image retrieval by Sivic et al. [6] in 2003. In the TF-IDF formula, the TF part reflects the number of keypoints featured by this visual word. As a result, the TF distribution in an image is informative about textures, such as repetitive structures. On the other hand, the IDF part determines the contribution of a certain visual word. The presence of a less common visual word in an image may be a better discriminator than that of a more common one. The IDF weight of a visual word zk is denoted as: IDF (zk ) = log

N , nk

(1)

where N denotes the total number of images in the collection, and nk encodes the number of images where zk occurs. This weight is the Collection Frequency Weight (CFW) introduced by Sparck Jones [13]. In text retrieval, a variety of weighting methods have been proposed, such as the Okapi-BM25 [55], the pivoted normalization weighting [56], etc. Typically, these methods focus on the TF part. A comparison with OkapiBM25 is presented in Table II. Taking into account the TF-IDF weighting, the similarity score between two images q and d is, ∑K qi di IDF (i)2 , (2) sim(q, d) = i=1 ∥q∥∥d∥



4

where ∥ · ∥ represents L2 -norm and ∥q∥∥d∥ forms the normalization factor. In Eq. 2, qi and di encode the TF scores of visual word i in two images, respectively, and K is the codebook size. A comparison between different normalization methods is shown in Section V-C. We use the same similarity function as in [43]. The idea of TF-IDF is also implicitly included in other models, such as the Fisher Vector (FV) [57], [58]. It is shown that under the Dirichlet compound multinomial model, the fisher kernel is closely connected with the TF-IDF method [59]. In [58], the GMM model has a similar function: a descriptor which falls into a frequent Gaussian component and which is close to the center is likely to be a background distractor, and is automatically discounted. The performance of FV is shown to be superior to BoW [58] with a compact representation. For BoW model, however, since the number of visual words is large, it is not tractable to estimate each visual word as a Gaussian mixture. Moreover, the proximity to cluster centers is typically reflected in weights such as in Hamming Embedding [43] or soft quantisation [27], rather than in IDF. Therefore, this paper concentrates on the TF-IDF method applied in BoW model. A possible future work would be an exploration to the TF-IDF discount in the FV model.

B. Lp -norm IDF For a given visual word zk , its IDF score is negatively correlated with its visual word frequency uk (whether this visual word is common or rare in the database). The conventional IDF treats uk as the number of images possessing zk , which is denoted as nk in this paper. Although this strategy agrees with the basic idea, the estimation of uk is coarse. In an extreme case as illustrated in Fig. 1, visual words zx and zy appear in all the images through I1 to I6 . According to Eq. 1, both IDF(zx ) and IDF(zy ) are equal to zero. It indicates that zx and zy are totally worthless for image retrieval. However, if we consider the fact that the frequency distribution of zx and zy over the entire image collection are quite different, we may realize that these visual words indeed possess some discriminative power, which is ignored by the conventional IDF formula. Therefore, we seek to augment the IDF formula with the TF distribution, a process featured by Lp -norm pooling. Specifically, assume that an image collection D consists of N images, nk of which contain visual word zk . We denote the image set containing zk as Pk = {I ∈ D|zk ∈ I}, and |Pk | = nk . From the quantized images {vi }N i=1 , we seek to estimate the frequency uk of visual word zk . Conventional IDF treats uk = |Pk | = nk . Our method, instead, employs the Lp -norm pooling [60] to perform the estimation, which can be formulated as a weighted sum of the TF data across the N database images, i.e., uk =

N ∑ i=1

p wi,k vi,k =

∑

p wi,k vi,k , p ≥ 0.

(3)

Ii ∈Pk

Built upon the adjusted estimation of visual word frequency

1

2

3

4

5

N

k

nk % 1 12 2 1 !!! 5 max "12, 2,1,...,5#

Lp - norm IDF

Max IDF

Avg IDF

Cvt IDF uk $ w1 !12 p

w3 ! 2 p

w4 !1p !!! wN ! 5 p

Fig. 3. Illustration of four different IDF schemes. An image collection consists of N images indexed from 1 to N . The term frequencies of word zk in each image is depicted below. The formulas demonstrate how to calculate the estimated word frequency of zk for the four IDFs, e.g., the conventional IDF, average IDF, max IDF, as well as the Lp -norm IDF introduced in this paper.

uk , our framework is presented as follows: pIDF (zk ) = log

N N = log ∑ p , p ≥ 0, uk Ii ∈Pk wi,k vi,k

(4)

where vi,k denotes the occurrences of zk in image Ii . The estimated frequency uk is the weighted sum of individual TF data vi,k . Parameter p determines the extent to which the term frequency contributes to the estimated value. The coefficient wi,k reflects the contribution of each image containing zk to the frequency estimation. Therefore, wi,k should encode the following properties. First, images vary a lot in length, i.e., the number of visual words it contains. Images with a greater length tend to contain more instances of zk . Put it another way, it is more probable that zk appears in large images. If it is the case, we should overestimate its frequency, thus lowering its IDF score. Consequently, we should extend the wi,k interpretation by positively correlating it with image length di . For numerical reasons, it is appropriate to introduce the normalization by ¯ It ensures that relating image length to the average value d. an image of average length has the same weight after image length normalization. Next, we seek to incorporate codebook information into wi,k . Given that uk is larger for a smaller codebook, another normalization should be considered. We propose to normalize w∑ i,k by the average value of vi,k , in the form of log(1 + n1k Ii ∈Pk vi,k ). In addition, for practical implementation, the IDF weights of visual words should be non-negative (each visual word, no matter how often it appears in bursts, should at least have some discriminative power). Taking the aforementioned considerations into account, the Lp -norm IDF is finally formulated in Eq. 5: N p ), Ii ∈Pk wi,k vi,k di /d¯ ∑ = 1 log(1 + nk Ii ∈Pk vi,k )

pIDF (zk ) = log(1 + ∑ where p ≥ 0, wi,k

(5)

In Eq. 5, di and d¯ denote the number of features in image Ii and the average number of features for images in the database,



5

respectively. C. Average IDF and Max IDF First, we show how the conventional IDF is derived from our generalized IDF family. Assume wi,k = 1 and p = 0, Eq. 4 simply reduces to the conventional IDF representation in Eq. 1. Therefore, the conventional IDF is a special case of the Lp -norm IDF. Moreover, from Eq. 4, we can further induce two novel IDF variants, i.e., the average IDF and the max IDF. If we set wi,k = 1 and p = 1, the average IDF is defined as, N N = log ∑ , aIDF (zk ) = log uk Ii ∈Pk vi,k

(6)

where uk is approximated by the L1 -norm of vi,k (i = 1, · · · , nk ). Then, if we consider the L∞ -norm when wi,k = 1, the max IDF can be derived as well, N N mIDF (zk ) = log = log . (7) uk max vi,k i

These two IDF variants correspond to the average pooling and max pooling, respectively. An example of how the four different IDFs discussed above are calculated is presented in Fig. 3. The pooling technique we used here differs from that in feature pooling in two aspects. First, the subject of the pooling here is the bag-of-words image representation vi , instead of a subregion in the partitioned image in SPM [61]. Second, pooling used here is to aggregate the response of the whole image collection to the frequency accumulates of each visual word, while in feature pooling, the result is the response of an image to each visual word. D. A Parameter Tuning Strategy of Lp -norm IDF Basically, to determine the optimal value of p in Eq. 5, one would like to plot a p-mAP curve for parameter tuning. For general usage, however, an more principled procedure will bring benefits. To the best of our knowledge, previous literature has few, if any, references to resort to for our task. Therefore, in this paper we propose a parameter tuning strategy of the optimization process. Then, using the p-mAP curves, we demonstrate the tradeoffs in estimating the p value for a given system. Specifically, our idea of parameter tuning is presented as follows. We seek to minimize a cost function concerning visual word discriminative power. In a nutshell, we aim at equalizing the discriminative power of different visual words. In fact, the TF-IDF weight encodes the importance of a visual word in separating one image from the others, i.e., the discriminative power, and we simply take the product of TF and IDF value as the measurement. For Lp -norm IDF, TF and pIDF are negatively correlated: generally, pIDF punishes large TF and favors small TF. In other words, the Lp -norm IDF aims at achieving a balance between the two weighting factors. In this manner, burstiness (high TF) is punished. In literature, TF suppression can be √ performed by (·) or log (·) operators [14], [55]. In [62],

only the one-to-one matching scheme is allowed, so the TF value is at most 1. Our work, instead, employs a global weighting factor, i.e., the Lp -norm IDF, to eliminate the TF differences. By tuning the parameter p, different visual words would possess similar TF·pIDF value, so that the burstiness problem is correspondingly addressed. More specifically, the objective function is to minimize the discriminative power diversity among visual words, namely, 1 ∑ vi,k · pIDF (zk )}, arg min var{ (8) p k n k

Ii ∈Pk

where the variance operator characterizes the diversity of discriminative power among visual words. The discriminative power of a visual word is described by its average TF · pIDF value. Eq. 8 aims to balance optimally the relationship between TF and IDF weights, thus minimizing the discriminative power diversity of different visual words in the codebook. The parameter tuning strategy in Eq. 8 does not have a closed form solution for p. Therefore, we adopt a greedy search method on the Flickr 1M dataset to obtain the optimal value of p. Note that a non-trivial global solution of Eq. 8 involves p = ∞. But in this case, we find that none of the visual word are “useful”, which is clearly undesirable. With this consideration, according to our prior knowledge, we restrict the search scope of p to 1−6, from which a valid p value will be discovered. The result is demonstrated in Section V-C. IV. V ISUAL W ORD B URSTINESS In text retrieval, the term positive adaption or burstiness refers to the phenomenon in which words tend to appear in bursts, i.e., once they appear in a document, they are more likely to appear again. In the image retrieval community, burstiness often describes the phenomenon that repetitive structures are present (see the first row in Fig. 4), so that certain visual words occurs many times in a single image. Consider the scenario where a query image contains a visual word that appears in bursts in an irrelevant image. In this case, this irrelevant image would get a high similarity score, or even a high rank. Due to the prevalent existence of burstiness, the retrieval performance can be compromised to a very large extent. In [14], intra-image and inter-image burstiness are discussed. Our work differs in that we leverage the term frequencies across the database, followed by the optimization of the IDF weight. Another difference lies in that [14] penalizes burstiness by computing a normalization factor on-the-fly, while our method assigns weights to visual words on the visual word level and in an offline manner. A comparison of our method and [14] is shown in Section V-D. To analyze the burstiness phenomenon, we plot the visual word distribution in Fig. 5. For each visual word k defined in the codebook, its term frequency (TF) in image i is denoted as vi,k . Then, we count its maximum TF across the image collection, i.e., max vi,k . The statistical histogram of max vi,k i i (k = 1, ..., K, where K is the codebook size), is shown in Fig. 5(a). Then, we denote the maximum term frequency in image i as Ni , and count the number of images that has a certain value of Ni , forming a histogram as shown in Fig. 5(b).



6

Fig. 4. Visual word burstiness and the impact of Lp -norm IDF. (Top): the burstiness phenomenon. The same markers represent the same visual words. Note the repetition in these images. (Bottom): the impact of Lp -norm IDF. We plot two markers for each keypoint: red and green markers denote the Lp -norm IDF and the conventional IDF, respectively. The area of the markers encodes the amplitude of the weights. Note that visual words in repetitive structures (burstiness) are heavily down-weighted, while the weights of discriminative structures are preserved. A more detailed description is provided in the text.

5

number of words

4

x 10

3 2 1 0

0

5

10

15

20

number of images

15

25

30

35

40

maximum term frequency of a visual word

4

45

50

(a)

x 10

10

5

0

0

5

10

15

20

25

30

35

maximum term frequency of an image

40

45

50

(b)

Fig. 5. (a): Histogram of visual words for different values of maximum term frequency; (b): Histogram of images for different values of maximum term frequency. The data is evaluated over Flickr 1M dataset, and the codebook size is 1M.

Fig. 5(a) suggests that a majority of visual words maximally occur 2 or 3 times in an image. On the other aspect, Fig. 5(b) shows that most images have a maximal term frequency of 5 or 6. This phenomenon is more notable if we consider the fact that an image typically has hundreds of visual words while the

codebook size is very large, say, 1 million, meaning that it is of low probability to produce multiple occurrences of the same visual word given independence assumption. Meanwhile, note that these statistics are collected from general images, which is not confined to man-made objects such as buildings, etc. Therefore, the burstiness phenomenon (see the first row of Fig. 4) widely exists in the image retrieval settings. The analysis above reveals the importance of repressing the weights of the visual words that tend to appear in bursts, a problem of which can be effectively solved using the proposed Lp -norm IDF. The second row of Fig. 4 depicts the impact of the proposed Lp -norm IDF on burstiness. The green and red markers are co-located with each keypoint, and correspond to the conventional IDF and the Lp -norm IDF weights, respectively. The size of the markers is proportional to the IDF value. For each keypoint, a small red marker (in the centre of a large green marker) indicates that the visual word is heavily downweighted. For keypoints which are also displayed in images in the first row (these keypoints are part of repetitive structures), the red markers are typically very small. Therefore, keypoints in bursts have a lower weight. On the other hand, big red markers denote slightly, if any, down-weighted visual words, which are quite discriminative structures. These keypoints are typically not displayed in images in the first row. In Fig. 4, it is obvious that the more elaborated structures such as people and discriminative shapes are retained as before (receive a large red marker). However, the repetitive structures are heavily punished (receive a small red marker), involving



7

0.74

1.15

0.62

0.73

0.6

0.72 0.58

0.71 0.56

1.05

mAP

0.7

mAP

value of Equation 8

1.1

0.69 0.68

TF−pIDF, MA = 5 TF−pIDF, MA = 4 TF−pIDF, MA = 3 TF−pIDF, MA = 2 TF−IDF data6

0.67

1

0.66 0.65 0.64

0.95

1

2

3

0.54 TF−pIDF, MA = 5 TF−pIDF, MA = 4 TF−pIDF, MA = 3 TF−pIDF, MA = 2 TF−pIDF, MA = 1 TF−IDF

0.52 0.5 0.48

4

5

6

0.46

1

2

3

value of p

0.9

(a) Oxford 5K 2

2.5

3

3.5

4

4.5

4

5

6

value of p

(b) Paris 6K

5

value of p

Fig. 6. Selection of parameter p. On the Flickr 1M dataset, we plot Eq. 8 against different values of p. Equation 8 achieves its minimum when p is about 3.5.

man-made constructions and structured background, etc. As a result, the Lp -norm IDF punishes visual word burstiness, while preserving the discriminative structures. V. E XPERIMENTS We evaluate the performance of the proposed Lp -norm IDF method on large scale image retrieval task. Experiments are conducted on four benchmark datasets populated with 1M distractor images. In this section, the experimental results are summarized and analysed. A. Datasets To evaluate the effectiveness of the Lp -norm IDF, we conducted experiments on five publicly available datasets: Oxford 5K [3], Paris 6K [63], Holidays [43], Ukbench [4], and Flickr 1M [43]. Oxford 5K. Oxford 5K dataset was collected from Flickr and a total number of 5062 images have been obtained. This dataset has been generated as a comprehensive ground truth for 11 distinct landmarks, each containing 5 queries. In total there are 55 query images. Paris 6K. Paris 6K dataset was generated in coupling with Oxford 5K. This dataset contains 6385 high resolution images from Flickr by queries of Paris landmarks. Again, Paris dataset is featured by 55 queries of 11 different landmarks. Holidays. The Holidays dataset is composed of 500 queries from 1491 scene images. For each query, a few (mostly 1 or 2) ground truth images are annotated. For each query in the Oxford 5K, Paris 6K, and Holidays datasets, we use Average Precision (AP) to evaluate retrieval performance, which is calculated as the area under the Precision-Recall curve. Then, APs of all query images are averaged, yielding mean Average Precision (mAP), which is employed to evaluate retrieval accuracy on each dataset. Ukbench. The Ukbench dataset contains 10200 images of 2550 groups. Each group has 4 images of the same object, taken from different views. In this dataset, each image serves as a query image, so there are 10200 queries. The performance is measured by the recall of the top-4 candidate images, referred to as N-S score (maximum 4). Flickr 1M dataset. The Flickr 1M dataset includes 1 million distractor images arbitrarily retrieved from Flickr. These

Fig. 7. mAP results against different values of p and different MA on Oxford and Paris datasets. We use Multiple Assignment (MA) [43] here, which assigns multiple visual words to a descriptor, and we denote the number of quantized visual words as MA. Five curves are plotted, corresponding to MA = 1,2,...,5, respectively. The dash line shows the baseline results. It is evident from these results that when p takes an appropriate value, our method is consistently higher than the baseline. The maximum is achieved when p is about 3-4. MA is set to 3 for effectiveness and efficiency considerations.

images are added into the Oxford 5K, Paris 6K, Holidays, and Ukbench datasets to test the scalability of our approach. B. Baseline In this paper, We adopt the image retrieval procedure proposed in [3] as the baseline approach. During preprocessing, we extract salient keypoints in the images from which the 128-dimension SIFT descriptors are computed. We have implemented the RootSIFT [9] variant (see Table III and Table IV), due to its effectiveness under Euclidean distance. Then, from a pool of SIFT features, a codebook is constructed by Approximate K-means (AKM) method [3]. Then, the inverted file is built which indexes database images and allows efficient access. For online retrieval, SIFT features of the query image are quantized using the approximate nearest neighbors (ANN) indexing structure. For each query visual word, candidate images are found from the corresponding entry in the inverted file. Scores for these candidate images are calculated using conventional TF-IDF in Eq. 2. We do not apply the stop word technique. When L2-normalization is used, the L2-norm of an image is calculated using the TF histogram without IDF consideration. For Oxford 5K and Paris 6K datasets, if not specified, Hessian-affine detector [3] is employed. In our implementation, we only allow a one-to-one mapping between SIFT descriptors and Hessian affine regions. This modification reduces the false matches brought by multiple SIFTs per location, producing a higher baseline result compared with [3]. We also report results obtained by DoG-Affine detector in Table III, in which the scaling factor is set to 12.5 as in [64]. For the Holidays and Ukbench datasets, we use the SIFT descriptors provided by their websites. In addition, Multiple Assignment (MA) [43] is employed, in which a descriptor is quantized to several visual words, e.g., {zk1 , zk2 , zk3 ...}. We denote the number of quantized visual words as MA. Moreover, a weighted version of MA, i.e., Soft Quantization (SQ) [27] is also tested. The difference between MA and SQ lies in that, the latter assigns a weight to each quantized visual word. The weight takes the form of



TABLE I D IFFERENT N ORMALIZATION M ETHODS

No L1 L2

Oxford 5K + Flickr 1M TF-IDF TF-pIDF 0.432 0.629 0.192 0.208 0.523 0.626

0.54

0.6

0.5

TF−IDF TF−avgIDF TF−maxIDF TF−pIDF

0.55

mAP

Normalization

0.57

0.7

0.65

mAP

M AP OF

8


0.46

0.5 0.42 0.45

0.4

0 50 100

250

500

750

0.38

1000

0 50 100

(a) Oxford 5K

For Lp -norm IDF formula defined in Eq. 5, the parameter p determines the extent to which burstiness is punished. It is evident from Eq. 5 that larger value of p indicates amplified punishment on visual word burstiness. An optimal value helps produce satisfying performance. First, we search for p that minimizes the cost function in Eq. 8 on Flickr 1M dataset. The results are illustrated in Fig. 6. We can see that Eq. 8 achieves the minimum value when p is about 3.5. Moreover, the curve in Fig. 6 is stable when p is around 3-4. Then, we vary the value of p, and test the performance on Oxford and Paris datasets. The mAP results are presented in Fig. 7. Here, note that when MA is employed, the IDF score with respect to each quantized visual word is used. It is clear from these results that when p is not too large (larger than 5), curves of our method are consistently higher than the TFIDF baseline. For MA, we observe a superior performance of MA = 3 in Fig. 7(a) and MA = 5 in Fig. 7(b). Nevertheless, since MA = 3 works almost equally well for Paris dataset, for efficiency considerations, we set MA = 3 for all datasets in the following experiments. Furthermore, in Fig. 7, the mAP first rises with p and then drops after reaching the peak. For different MA values, the peak is somewhere around p = 3.5. Similar to Fig. 6, for p between 3 and 4, the performance on the two datasets remains stable. Note that the results in Fig. 6 and 7 are to some extent consistent. It shows that Eq. 8 is effective in predicting the p value. We speculate that the reason lies in that Eq. 8 averages the importance of the visual words: some visual words with low IDF but not appearing in bursts are being “activated”; those appearing in bursts are de-emphasized; those with large IDF scores are preserved as they do not appear in burst. In this sense, we provide a possible solution to this sort of optimization problem by making a larger number of visual words count. We also find that there exists some inconsistency between the p value of the minimum in Fig. 6 and the maximum in Fig. 7. In fact, for the image retrieval system as a whole, there exists considerable noise affecting a local optimization, such as codebook bias (a sub-optimal partitioning of the feature space), quantization inaccuracy (using approximate nearest neighbor search), etc. These factors may cause the deviations between Fig. 6 and Fig. 7. In spite of this, in the following experiments, we set p to 3.5 for all

750

1000

(b) Paris 6K 3.5

3.4 0.6

N−S score

3.3


mAP

0.55

0.5

C. Parameter Analysis

500

codebook size (K)

0.65

2

exp(− σd 2 ), where d denotes the Euclidean distance between the visual word and the descriptor, and σ is a weighting parameter set to 0.01 in our experiment.

250

codebook size (K)


3.2

3.1

3

0.45

20

100

250

500

1000

2.9

0 50 100

250

500

codebook size (K)

codebook size (K)

(c) Holidays

(d) Ukbench

1000

Fig. 8. Image retrieval performance as a function of the codebook size for different weighting schemes, i.e., the TF-IDF baseline, TF-avgIDF, TFmaxIDF, and TF-pIDF. Mean Average Precision (mAP) for (a) Oxford 5K, (b) Paris 6K, (c) Holidays, and N-S score for (d) Ukbench datasets are presented. It is evident from these results that, the proposed Lp-norm IDF outperforms other TF-IDF variants at all codebook sizes.

the datasets. In fact, for general applications, p can be set to 3 − 4 without much noticeable difference in image retrieval performance. Table I demonstrates the results of using different normalization strategies in Eq. 2. L1 normalization measures the rate of descriptor matches. For a small query region in a large image, the L1 normalization will probably fail. Therefore, L1 normalization produces low baseline result. On the other hand, no normalization means to directly count the number of descriptor matches, where images of greater lengths (i.e., with more visual words) tend to produce more matches. L2 normalization takes a compromise between the above two methods, and produces the highest baseline result. Therefore, in Table I, although the Lp -norm IDF achieves +18.3% improvement (from 0.432 to 0.629) in the case of no normalization, we choose the L2 normalization in the following experiments. D. Evaluation Comparison of four IDFs We discussed four IDFs in Section III, i.e., the conventional IDF, average IDF (avgIDF), max IDF (maxIDF), and the Lp -norm IDF (pIDF). The last three are defined for the first time by our pooling method. Fig. 8 and Table II compare the retrieval accuracy of the four IDFs. Results on the four benchmark datasets are reported, which leads to three major observations. First, from Table II, max IDF is inferior to the baseline in large scale experiment. On the four datasets, an mAP drop of 10.0%, 10.6%, 7.1% and N-S drop of 0.18 is observed, respectively. Max IDF takes the maximum value of a word’s TF as the estimation of its frequency. So it neglects the document frequency, while conventional IDF neglects TF. On the 1M dataset where the number of images is large, this limitation is amplified. Consequently, on small datasets, max



I MAGE RETRIEVAL ACCURACY

Dataset TF-IDF TF-avgIDF TF-maxIDF Okapi-BM25 [55] Jégou et al. [14] TF-pIDF Jégou et al. [14] + pIDF

FOR

9

TABLE II D IFFERENT TF-IDF W EIGHTING M ETHODS ON B ENCHMARK DATASETS

Oxford, mAP Database Size 5K 1M 0.523 0.685 0.690 0.540 0.691 0.423 0.692 0.579 0.695 0.558 0.696 0.626 0.640 0.701

Paris, mAP Database Size 6K 1M 0.404 0.531 0.542 0.426 0.544 0.298 0.562 0.486 0.565 0.491 0.562 0.513 0.531 0.573

IDF shows similar performance with the baseline: max IDF slightly outperforms the baseline on the Oxford and Paris datasets, but slightly inferior on the Holidays and Ukbench datasets. However, on large scale datasets, the situation is reversed. Second, Fig. 8 and Table II show that, for each of the four datasets, average IDF is shown to be slightly superior to both the conventional IDF and the max IDF. Average IDF improves mAP by +1.7%, +2.2%, +4.4% and N-S by +0.04 on the four 1M datasets, respectively. In its nature, average IDF explicitly considers both the TF of visual words in each image and the document frequency, though in a very simple additive manner. Therefore, on both small and large datasets, average IDF gives better performance. Finally, it is evident in Fig. 8 and Table II that our proposed Lp -norm IDF method consistently outperforms the other three IDFs. The Lp -norm IDF estimates the word frequency using both term frequency and document frequency of each visual word. By carefully weighting the contribution of every database image, and optimizing the parameter p, Lp -norm IDF gives better weights to visual words, thus elaborately making significant improvement over the baseline approach. Comparison with other weighting methods Since the proposed Lp -norm IDF method focuses on the TF-IDF weighting scheme, we compare it with other TF-IDF variants, e.g., Jégou et al. [14] and the Okapi-BM25 weighting [55]. The results are presented in Table II. Okapi-BM25 mainly deals with the TF part. The default parameters are used. In Table II, this method improves over the baseline, but is inferior to our method. Since Okapi-BM25 is initially proposed in text retrieval where a query has only a few words, it may lose its power in image retrieval where thousands of visual words exist in a query image. We also compare our method with [14]. We use the intraimage burstiness solution in our experiment. We select the best-performance formula in [14]. From Table II, [14] produces better result on the Paris 6K dataset, while on the others the proposed Lp -norm IDF is shown to be superior. The weighing scheme √ proposed in [14] mainly focuses on the TF factor (with a (·) operator). The TF factor has a more direct impact on retrieval accuracy, because it deals with burstiness directly. Since our method is an IDF variant, we further combine it with the TF variant in [14]. The results in the last

Holidays, Database 1K 0.604 0.624 0.601 0.640 0.610 0.642 0.641

mAP Size 1M 0.391 0.435 0.320 0.504 0.512 0.521 0.530

Ukbench, N-S Database Size 10K 1M 3.050 3.393 3.418 3.094 3.408 2.865 3.417 3.209 3.447 3.204 3.454 3.267 3.305 3.476

line of Table II indicate that both methods are complementary to some extent, and can further improve performance. As a consequence, the above results validate the feasibility of Lp norm IDF to the large scale real-world applications. Combination with post-processing steps Post-processing techniques are effective means to improve retrieval accuracy [3], [47], [65]–[67]. In this paper, we add various postprocessing methods to our system to test if they benefit from the introduction of Lp -norm IDF. Specifically, for Oxford and Paris datasets, we add spatial verification (SP) using RANSAC and Query Expansion (QE), due to the fact that query images in both datasets are architectures with rigid spatial configurations and that the number of ground truth images is relatively large. Moreover, we try two different feature detectors, i.e., the Hessian Affine detector (HesAff) and the DoG Affine detector (DoGAff). According to [64], DoGAff detector produces a higher baseline. Further, the graph fusion technique [65] is employed here to fuse the rank results of BoW and global HSV histogram. We do not apply QE on these two datasets because the number of ground truth images is quite small. The results are shown in Table III and Table IV. We mainly discuss the case of DoGAff detector, and the situation in HesAff is similar. For the DoGAff detector, the performance of the BoW baseline can be noticeably improved by adding Lp -norm IDF (+5.9% and +3.4% mAP for the two datasets, respectively). With RootSIFT, the Lp -norm IDF (pIDF) outperforms RootSIFT on both Oxford and Paris by +4.0% and +2.4%, respectively. Moreover, when we add Multiple Assignment (MA) [43] and SP, we achieve 75.8% and 76.8% mAP on Oxford and Paris, respectively. In this scenario, pIDF further improves mAP of +3.6% (from 75.8% to 79.4%) and 1.6% (from 76.8% to 78.4%) for the two datasets. In the case of “RootSIFT + SQ + SP + QE”, we observe that pIDF yields the best performance: 85.9% (+3.0%) on Oxford, and 86.6% (+3.3%) on Paris, respectively. We notice that SQ produces slightly higher accuracy than MA, which can be attributed to the soft weighting scheme. On Holidays and Ukbench datasets, we implement the Graph Fusion (GF) technique [65] to fuse global HSV histogram and BoW rank lists. Specifically, we extract a 1000D HSV histogram each image, normalize it by its L2 -norm, √ and scale each dimension by a (·) operator. Using HSV



TABLE III R ESULTS ON OXFORD AND PARIS DATASETS C OMBINING VARIOUS M ETHODS

Oxford, HesAff IDF pIDF 0.685 0.696 0.706 0.721 0.740 0.752 0.839 0.813 0.692 0.714 0.728 0.750 0.745 0.759 0.830 0.807 0.818 0.836

Oxford, DoGAff IDF pIDF 0.659 0.718 0.694 0.739 0.712 0.754 0.844 0.799 0.704 0.744 0.744 0.774 0.758 0.794 0.852 0.814 0.829 0.859

Holidays IDF pIDF 0.604 0.642 0.598 0.650 0.818 0.843 0.606 0.654 0.615 0.663 0.832 0.850 0.832 0.851

Ukbench IDF pIDF 3.393 3.454 3.452 3.508 3.715 3.741 3.457 3.509 3.534 3.576 3.776 3.803 3.781 3.810

0.75

0.6

0.7

0.55

0.65

0.5

0.6

0.45

0.55

0.45 0.4

4

TF−pIDF TF−avgIDF TF−maxIDF TF−IDF no TF−IDF

0.35 0.3

5

10

0.25

6

10

10

4

5

10

10

database size

(a) Oxford 5K + Flickr 1M

(b) Paris 6K + Flickr 1M 3.6

0.65

3.5

0.6

3.4

0.55

3.3

0.5


0.4 0.35

3

10

3.2 3.1 3 2.9

4

5

10

6

10

database size

0.7

0.45

histogram alone, we obtain mAP = 65.4%, N-S= 3.402 on the two datasets, respectively, both higher than reported [65]. In Table IV, the combination of GF improves performance significantly. On Holidays, the method “RootSIFT + SQ + GF” yields mAP of 85.1% and 83.2% with and without pIDF, respectively; on Ukbench, the N-S score is 3.81 and 3.78, correspondingly. It is worth pointing out that Lp -norm IDF brings about more improvement for Query Expansion than for Graph Fusion. In fact, QE involves re-querying the database using top-ranked images, a process again using pIDF. In contrast, GF fuses BoW with HSV histogram, and the latter does not use pIDF. As a consequence, pIDF works better for QE. Scalability To evaluate the scalability of the proposed method, we populated the Oxford 5K, Paris 6K, Holidays, and Ukbench datasets with various fractions of the Flickr 1M dataset. Experimental results are demonstrated in Fig. 9 and Table II. Four major conclusions can be drawn. First, we examine the impact of the conventional IDF. It is obvious in Fig. 9 that the introduction of conventional TFIDF helps to improve performance over the “no TF-IDF” case. After all, the TF-IDF scheme takes into account both inter- and intra-image visual word distribution, providing discriminative power to some extent. However, since the conventional TFIDF is quite heuristic, the improvement is marginal: for each of the four datasets, about +4.4%, +2.5%, +2.2% in mAP, and 0.06 in N-S gain is observed on the 1M-scale database,

Paris, DoGAff IDF pIDF 0.715 0.749 0.739 0.763 0.746 0.771 0.860 0.827 0.735 0.759 0.755 0.776 0.768 0.784 0.864 0.832 0.833 0.866

0.4


0.5

mAP

Methods BoW BoW + MA BoW + MA + GF RootSIFT RootSIFT + MA RootSIFT + MA + GF RootSIFT + SQ + GF

mAP

TABLE IV ACCURACY ON H OLIDAYS ( M AP) AND U KBENCH (N-S) C OMBINING VARIOUS M ETHODS

Paris, HesAff IDF pIDF 0.663 0.682 0.680 0.699 0.694 0.709 0.789 0.767 0.687 0.698 0.706 0.724 0.725 0.737 0.795 0.779 0.781 0.796

mAP

Methods BoW BoW + SP BoW + MA + SP BoW + MA + SP + QE RootSIFT RootSIFT + SP RootSIFT + MA+ SP RootSIFT + MA + SP + QE RootSIFT + SQ + SP + QE

N−S score

M AP

10

10

6

10


2.8 4 10

5

10

(c) Holidays + Flickr 1M

6

10

database size

database size

(d) Ukbench + Flickr 1M

Fig. 9. mAP for (a) Oxford 5K, (b) Paris 6K, (c) Holidays, and N-S score for (d) Ukbench datasets scaled with Flickr 1M dataset as distractor images. Five methods are compared, i.e., method without TF-IDF weighting, the TF-IDF baseline and the proposed TF-avgIDF, TF-maxIDF and TF-pIDF, respectively. For each database scale, IDF scores are re-computed. It is clear that the Lp norm IDF outperforms the other four methods, especially on larger databases.

respectively. Second, it is notable that as the database gets scaled up, mAP of our proposed method drops much more slowly. That is to say, more significant improvement is obtained on larger database. Notably, the Lp -norm IDF outperforms the TF-IDF baseline by +10.3%, +10.9%, +13.0% in mAP, and +0.22 in N-S on Oxford, Paris, Holidays, and Ukbench + 1M datasets, respectively. These gains are measured in absolute value. The results validate the scalability of the proposed method. Third, we observe that the Lp -norm IDF generally works well on all the four datasets. In essence, the Oxford 5K and Paris 6K mainly contain images of buildings, while the Ukbench and Holidays consists of general objects and scenes. As a consequence, we can conclude that the Lp -norm IDF has a wide application scope, thus beneficial on general databases. Fourth, we check the impact of whether the codebook is ”homology” or ”non-homology” [19]. We note that for Oxford



11

TABLE V E FFICIENCY C OMPARISON

Datasets TF-IDF TF-pIDF BM25 [55] Jégou et al. [14]

Oxford + 1M 84.9 85.0 87.5 89.5

Average Retrieval Time 1 (ms) Paris + 1M Holidays + 1M 87.0 89.9 87.8 90.2 92.1 94.6 92.8 97.9

and Paris datasets, although the codebook is trained on Oxford dataset (homology on Oxford), more notable improvement is observed on Paris dataset (non-homology on Paris). The codebook trained on Oxford fits well the feature distribution, so can be quite discriminative for Oxford dataset, but instead much more ambiguous [25] for Paris. Therefore, the burstiness problem is more severe on Paris dataset. Our proposed method helps to down-weight visual word in bursts and alleviate the burstiness problem, so more improvement is brought on Paris dataset. Similarly, the codebook of the Holidays dataset is also obtained on distractor images, and the improvement can be as large as 13.0% in absolute value. This phenomenon indicates that Lp -norm IDF based approach generalizes well to the case where the codebook is trained on irrelevant data (nonhomology) and the improvement is much more considerable. In addition, in a dynamic system (new images are being added and some existing images are being removed), the Lp -norm IDF can be calculated in a “batch” manner. In other words, the IDF scores can be updated after a fixed time duration. Since the computation occurs in the back-stage mode, we introduce little computational cost. In Fig. 9, the IDF scores are actually re-computed for each database size, which can be viewed as a dynamic system. Therefore, our method is effective in the dynamic case. Efficiency Efficiency is an important issue in many computer vision tasks [73], [74]. In our evaluation, Table V compares the average retrieval time 1 of different approaches. Note that the sorting of the database image scores takes about 65ms out of the average retrieval time statistics. Since the Lp -norm IDF proposed in this paper is an offline approach, this method only marginally increases the offline training time but shed no influences on the online image retrieval. Therefore, the Lp norm IDF share the same time and memory efficiency with the baseline approach. The average retrieval time is 85.0 ms, 80 ms, 90.2 ms, and 80.0 ms for Oxford, Paris, Holidays, and Ukbench + 1M datasets, respectively. Okapi-BM25 weighting is the least efficient one, because this method computes the TF-IDF weights online, resulting in more efficiency loss. As a result, compared with the baseline, the BM25 weight, and [14], our method better meets user’s expectation of fast response time while enjoying much higher retrieval accuracy. Furthermore, the computation of Lp -norm IDF takes 6.80s on the 1M database, while the conventional IDF consumes 1 Average retrieval time does not include feature extraction and quantization. We run our experiment with C implementation on a 2.40-GHz CPU of a Sixteen-Core Intel Xeon server with 32 GB memory. On Flickr 1M dataset, quantization of an image takes 0.42s on average.

Ukbench + 1M 80.1 80.0 81.1 81.9

0.031s. Considering the fact that Lp -norm IDF is calculated in an offline manner, the computation time (6.80s) can nearly be neglected. On large scale web image retrieval applications, when the database gets larger, we envision a batch processing manner to update the Lp -norm IDF, i.e., the Lp -norm IDF can be calculated everytime when a predefined batch of images are indexed. Consequently, the proposed Lp -norm IDF is time efficient both online and offline, and can be readily adopted in real-world applications. Comparison with state-of-the-arts We compare our results with state-of-the-art methods on the four datasets in Table VI. First, We can see that our method achieves competitive results on Oxford and Paris datasets. Specifically, compared with [9], our system does not apply the Discriminative Query Expansion (DQE) and spatial database-side feature augmentation (SPAUG), which are shown to yield consistent improvements on Oxford and Paris datasets. Moreover, a different feature detector is used in [9], which produces a higher baseline. We also notice that [69] reports higher mAP on Oxford and Paris datasets. In [69], kNN reranking is employed on top of the retrieval system. They report mAP of 75.2% and 74.1% on the two dataset before reranking, and our results are 77.4% (+2.2%) and 77.6% (+3.5%) respectively. Also note that we report higher results on Ukbench and Holidays datasets than [69]. Second, for the comparison on Holidays and Ukbench datasets, we achieve the state-of-the-art results after combining GF as a post-processing step. Specifically, we have obtained mAP of 85.1% on Holidays and N-S of 3.81 on Ukbench. Because the illumination changes on these two datasets are less severe compared to Oxford and Paris datasets, the fusion of global color feature works well. We provide four groups of sample retrieval results, one for each dataset, in Fig. 10. We can see that the proposed Lp -norm IDF can enhance the retrieval accuracy greatly. VI. C ONCLUSION This paper proposed an efficient yet effective IDF weighting scheme, i.e., the Lp -norm IDF. By Lp -norm pooling, we integrate term frequency, document frequency, document length as well as the codebook information, into the final IDF representation. The Lp -norm IDF functions on the visual word level, and can deal with the burstiness problem by downweighting visual words in bursts. The parameter p is tuned by minimizing a cost function.



12

TABLE VI C OMPARISON WITH STATE - OF - THE - ART METHODS ON H OLIDAYS AND U KBENCH

Methods Oxford, mAP(%) Paris, mAP(%) Ukbench, N-S score Holidays, mAP(%)

Ours 85.9 86.6 3.81 85.1

[68] 84.9 82.4 75.8

[65] 3.77 84.6

[51] 3.56 78.0

[9] 92.9 91.0 -

[69] 88.4 91.1 3.52 76.2

[70] 68.7 3.60 80.9

[46] 3.50 78.9

DATASETS

[28] 74.7 3.42 81.3

[71] 85.0 85.5 82.1

[72] 81.4 80.3 3.67 -

[14] 68.5 3.64 84.8

Query 2[IRUG

Query 3DULV

Query +ROLGD\V

Query 8NEHQFK

Fig. 10. Sample retrieval results of (Top:) Oxford, (Middle:) Paris, (Bottom Left:) Holidays, (Bottom Right:) Ukbench datasets using a codebook of 1M visual words. The query images are to the left of the dash line and the following images correspond to the top-ranked images (discarding the query). For each query, the baseline results are presented in the first row, and the Lp -norm IDF results are shown in the second row. Green/red boxes indicate the correct/incorrect results, respectively. Among the four samples, mAP is increased from 14.9% to 63.7% (+48.8%) on Oxford, mAP from 50.0% to 58.3% (+8.3%) on Paris, mAP from 12.5% to 100.0% (+87.5%) on Holidays, and N-S from 2 to 4 on Ukbench, respectively.

Extensive experiments on several benchmark datasets show that the proposed Lp -norm IDF has five advantages. First, Lp -norm IDF brings about improvement at various codebook sizes, which is of vital importance because codebook size matters a lot while no effective methods have been proposed to optimally select a codebook size. Second, on large scale experiments, our method is shown to be increasingly beneficial when the database gets scaled up. This enables the potential merits of Lp -norm IDF in industrial use. Thirdly, the Lp norm IDF outperforms several TF-IDF weighting approaches, including the widely used Okapi-BM25 method. When combined with post-processing steps, we are capable of producing competitive results compared with the state-of-the-art methods. Fourthly, the Lp -norm IDF is shown to achieve more prominent improvements when the codebook is trained on irrelevant data, which demonstrates its generalization power. Finally, since the Lp -norm IDF is an offline approach, it enjoys the same memory usage and computation efficiency as the baseline model. In the future, more investigation will be focused on the empirical studies of visual word frequency distribution and its discriminative power. We envision that a more uniformly distributed frequency profile may serve to enhance the discriminative power of the codebook overall. This study re-issues the

importance of visual word weighting, and various weighting strategies will be studied. R EFERENCES [1] W. Lu, J. Wang, X.-S. Hua, S. Wang, and S. Li, “Contextual image search,” in ACM Multimedia, 2011, pp. 513–522. [2] X. Li, Y.-J. Zhang, B. Shen, and B.-D. Liu, “Image tag completion by low-rank factorization with dual reconstruction structure preserved,” in Image Processing (ICIP), 2014 IEEE International Conference on, 2014. [3] J. Philbin, O. Chum, M. Isard, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in CVPR, 2007, pp. 1–8. [4] D. Niester and H. Stewenius, “Scalable recognition with a vocabulary tree,” in CVPR, 2006, pp. 2161–2168. [5] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Workshop on statistical learning in computer vision, ECCV, vol. 1, 2004, p. 22. [6] J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” in ICCV, 2003, pp. 1470–1477. [7] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004. [8] K. Simonyan, A. Vedaldi, and A. Zisserman, “Descriptor learning using convex optimisation,” in ECCV, 2012, pp. 243–256. [9] R. Arandjelovic and A. Zisserman, “Three things everyone should know to improve object retrieval,” in CVPR, 2012, pp. 2911–2918. [10] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval. ACM press New York, 1999, vol. 463. [11] B.-k. Wang, Y.-f. Huang, W.-x. Yang, and X. Li, “Short text classification based on strong feature thesaurus,” Journal of Zhejiang University SCIENCE C, vol. 13, no. 9, pp. 649–659, 2012.



[12] Z. Liu, H. Li, L. Zhang, W. Zhou, and Q. Tian, “Cross-indexing of binary sift codes for large-scale image search,” Image Processing, IEEE Transactions on, vol. 23, no. 5, pp. 2047–2057, 2014. [13] K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, vol. 28, no. 1, pp. 11–21, 1972. [14] H. Jégou, M. Douze, and C. Schmid, “On the burstiness of visual elements,” in CVPR, 2009, pp. 1169–1176. [15] L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Lp-norm idf for large scale image search,” in CVPR, 2013, pp. 1626–1633. [16] T. Tuytelaars and C. Schmid, “Vector quantizing feature space with a regular lattice,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8. [17] R. Ji, X. Xie, H. Yao, and W.-Y. Ma, “Vocabulary hierarchy optimization for effective and transferable retrieval,” in CVPR, 2009, pp. 1161–1168. [18] J. Wang, J. Wang, Q. Ke, G. Zeng, and S. Li, “Fast approximate k-means via cluster closures,” in CVPR, 2012, pp. 3037–3044. [19] Y. Xie, S. Jiang, and Q. Huang, “Weighted visual vocabulary to balance the descriptive ability on general dataset,” Neurocomputing, vol. 119, pp. 478–488, 2013. [20] R. Ji, H. Yao, W. Liu, X. Sun, and Q. Tian, “Task-dependent visualcodebook compression,” Image Processing, IEEE Transactions on, vol. 21, no. 4, pp. 2282–2293, 2012. [21] J. Cai, Z.-J. Zha, H. Luan, S. Zhang, and Q. Tian, “Learning attributeaware dictionary for image classification and search,” in ACM ICMR, 2013, pp. 33–40. [22] R. López-Sastre, T. Tuytelaars, F. Acevedo-Rodr´ıguez, and S. Maldonado-Bascón, “Towards a more discriminative and semantic visual vocabulary,” Computer Vision and Image Understanding, vol. 115, no. 3, pp. 415–425, 2011. [23] R. Lopez-Sastre, J. Renes-Olalla, S. Maldonado-Bascon, P. Gil-Jimenez, and S. Lafuente-Arroyo, “Heterogeneous visual codebook integration via consensus clustering for visual categorization,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 23, no. 8, pp. 1358–1368, 2013. [24] L. Zheng, S. Wang, W. Zhou, and Q. Tian, “Bayes merging of multiple vocabularies for scalable image retrieval,” arXiv:1403.0284v2, 2014. [25] J. C. van Gemert, C. J. Veenman, A. W. Smeulders, and J.-M. Geusebroek, “Visual word ambiguity,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 7, pp. 1271–1283, 2010. [26] J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders, “Kernel codebooks for scene categorization,” in ECCV. Springer, 2008, pp. 696–709. [27] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in CVPR, 2008, pp. 1–8. [28] H. Jégou, M. Douze, and C. Schmid, “Improving bag-of-features for large scale image search,” International Journal of Computer Vision, vol. 87, no. 3, pp. 316–336, 2010. [29] Y.-G. Jiang, C.-W. Ngo, and J. Yang, “Towards optimal bag-of-features for object categorization and semantic video retrieval,” in CIVR. ACM, 2007, pp. 494–501. [30] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, 2009, pp. 1794–1801. [31] Y. Cai, W. Tong, L. Yang, and A. Hauptmann, “Constrained keypoint quantization: towards better bag-of-words model for large-scale multimedia retrieval,” in ACM ICMR, 2012, p. 16. [32] Y. Li, S. Wang, Q. Tian, and X. Ding, “Learning cascaded shared-boost classifiers for part-based object detection,” Image Processing, IEEE Transactions on, vol. 23, no. 4, pp. 1858–1871, 2014. [33] Z. Liu, H. Li, W. Zhou, R. Zhao, and Q. Tian, “Contextual hashing for large-scale image search,” Image Processing, IEEE Transactions on, vol. 23, no. 4, pp. 1606–1614, 2014. [34] T. Wang, S. Wang, and X. Ding, “Detecting human action as spatiotemporal tube of maximum mutual information,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 24, no. 2, pp. 277–290, 2013. [35] F. He, Y. Li, S. Wang, and X. Ding, “A novel hierarchical framework for human head-shoulder detection,” in Image and Signal Processing (CISP), 2011 4th International Congress on, vol. 3, 2011, pp. 1485– 1489. [36] L. Zheng and S. Wang, “Visual phraselet: Refining spatial constraints for large scale image search,” Signal Processing Letters, IEEE, vol. 20, no. 4, pp. 391–394, 2013.

13

[37] W. Zhou, H. Li, Y. Lu, and Q. Tian, “Sift match verification by geometric coding for large-scale partial-duplicate web image search,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), vol. 9, no. 1, p. 4, 2013. [38] W. Zhou, H. Li, Y. Lu, and Q. Tian,, “Principal visual word discovery for automatic license plate detection,” Image Processing, IEEE Transactions on, vol. 21, no. 9, pp. 4269–4279, 2012. [39] S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li, “Descriptive visual words and visual phrases for image applications,” in ACM MM, 2009, pp. 75–84. [40] S. Zhang, Q. Huang, G. Hua, S. Jiang, W. Gao, and Q. Tian, “Building contextual visual vocabulary for large-scale image applications,” in ACM MM, 2010, pp. 501–510. [41] L. Xie, Q. Tian, and B. Zhang, “Spatial pooling of heterogeneous features for image classification,” Image Processing, IEEE Transactions on, vol. 23, no. 5, pp. 1994–2008, 2014. [42] L. Zhang, Y. Zhang, J. Tang, X. Gu, J. Li, and Q. Tian, “Topology preserving hashing for similarity search,” in ACM Multimedia, 2013, pp. 123–132. [43] H. Jégou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in ECCV, 2008, pp. 304–317. [44] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian, “Towards codebook-free: Scalable cascaded hashing for mobile image search,” Image Processing, IEEE Transactions on, vol. 16, no. 3, pp. 601–611, 2014. [45] L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Packing and padding: Coupled multi-index for accurate image retrieval,” arXiv:1402.2681v2, 2014. [46] C. Wengert, M. Douze, and H. Jégou, “Bag-of-colors for improved image search,” in ACM Multimedia, 2011, pp. 1437–1440. [47] Z. Liu, S. Wang, L. Zheng, and Q. Tian, “Visual reranking with improved image graph,” in ICASSP, 2014, pp. 6909–6913. [48] B. Su, X. Ding, L. Peng, and C. Liu, “A novel baseline-independent feature set for arabic handwriting recognition,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, 2013, pp. 1250–1254. [49] S. Bu and Y.-J. Zhang, “Image retrieval with hierarchical matching pursuit,” in Image Processing (ICIP), 2014 IEEE International Conference on, 2014. [50] L. Zheng, S. Wang, F. He, and Q. Tian, “Seeing the big picture: Deep embedding with contextual evidences,” arXiv:1406.0132, 2014. [51] X. Wang, M. Yang, T. Cour, S. Zhu, K. Yu, and T. X. Han, “Contextual weighting for vocabulary tree based image retrieval,” in ICCV, 2011, pp. 209–216. [52] Y. Zhang, Z. Jia, and T. Chen, “Image retrieval with geometry-preserving visual phrases,” in CVPR, 2011, pp. 809–816. [53] Y. Su and F. Jurie, “Improving image classification using semantic attributes,” International journal of computer vision, vol. 100, no. 1, pp. 59–77, 2012. [54] Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang, “Query-adaptive image search with hash codes,” IEEE TRANSACTIONS ON MULTIMEDIA, vol. 15, no. 2, 2013. [55] S. E. Robertson and S. Walker, “Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval,” in Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, 1994, pp. 232–241. [56] A. Singhal, “Modern information retrieval: A brief overview,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 35–43, 2001. [57] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in ECCV, 2010, pp. 143–156. [58] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Large-scale image retrieval with compressed fisher vectors,” in CVPR, 2010, pp. 3384– 3391. [59] C. Elkan, “Deriving tf-idf as a fisher kernel,” in String Processing and Information Retrieval, 2005, pp. 295–300. [60] J. Feng, B. Ni, Q. Tian, and S. Yan, “Geometric Lp -norm feature pooling for image classification,” in CVPR, 2011, pp. 2609–2704. [61] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: spatial pyramid matching for recognizing natual scene categories,” in CVPR, 2006, pp. 2169–2178. [62] C.-Z. Zhu, H. Jégou, S. Satoh et al., “Query-adaptive asymmetrical dissimilarities for visual object retrieval,” in ICCV, 2013. [63] M. Perd’och, O. Chum, and J. Matas, “Efficient representation of local geometry for large scale object retrieval,” in CVPR, 2009, pp. 9–16. [64] K. Simonyan, A. Vedaldi, and A. Zisserman, “Learning local feature descriptors using convex optimisation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2014.



[65] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas, “Query specific fusion for image retrieval,” in ECCV 2012. Springer, 2012, pp. 660– 673. [66] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: automatic query expansion with a generative feature model for object retrieval,” in ICCV, 2007, pp. 1–8. [67] L. Xie, Q. Tian, W. Zhou, and B. Zhang, “Fast and accurate nearduplicate image search with affinity propagation on the imageweb,” Computer Vision and Image Understanding, 2014. [68] A. Mikul´ık, M. Perdoch, O. Chum, and J. Matas, “Learning a fine vocabulary,” in ECCV 2010. Springer, 2010, pp. 1–14. [69] X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu, “Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking,” in CVPR, 2012, pp. 3013–3020. [70] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, “Semantic-aware co-indexing for near-duplicate image retrieval,” in ICCV, 2013. [71] D. Qin, C. Wengert, and L. Van Gool, “Query adaptive similarity for large scale object retrieval,” in CVPR, 2013, pp. 1610–1617. [72] D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. Van Gool, “Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors,” in CVPR, 2011, pp. 777–784. [73] X. Yang, X. Gao, D. Tao, and X. Li, “Improving level set method for fast auroral oval segmentation,” Image Processing, IEEE Transactions on, vol. 23, no. 7, pp. 2854–2865, 2014. [74] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik, “Fast, accurate detection of 100,000 object classes on a single machine,” in CVPR, 2013, pp. 1814–1821.

Liang Zheng received the B.E. degree in Life Science from Tsinghua University, China, in 2010. He is currently a Ph.D candidate in Information and Communication Engineering in the Department of Electronic Engineering, Tsinghua University, China. His research interests include multimedia information retrieval and computer vision.

Shengjin Wang (M’04) received the B.E. degree from Tsinghua University, China, and the Ph.D. degree from the Tokyo Institute of Technology, Tokyo, Japan, in 1985 and 1997, respectively. From 1997 to 2003, he was a member of the research staff with the Internet System Research Laboratories, NEC Corporation, Japan. Since 2003, he has been a Professor with the Department of Electronic Engineering, Tsinghua University, where he is currently the Director of the Research Institute of Image and Graphics. His current research interests include image processing, computer vision, video surveillance, and pattern recognition.

Qi Tian (M’96-SM’03) received the B.E. degree in electronic engineering from Tsinghua University, China, the M.S. degree in electrical and computer engineering from Drexel University, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign, in 1992, 1996, and 2002, respectively. He is currently a Professor with the Department of Computer Science, University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia from 2008 to 2009. His research interests include multimedia information retrieval and computer vision.


14

Spectral embedded hashing for scalable image retrieval.

Generalized 2-D Principal Component Analysis by Lp-Norm for Image Analysis.

Computer-aided diagnosis of mammographic masses using scalable image retrieval.

Scalable mobile image retrieval by exploring contextual saliency.

Autoregressive model in the Lp norm space for EEG analysis.

An Lp (0 ≤ p ≤ 1)-norm regularized image reconstruction scheme for breast DOT with non-negative-constraint.

Lp-norm regularization in volumetric imaging of cardiac current sources.

An image retrieval framework for real-time endoscopic image retargeting.

Pareto-depth for multiple-query image retrieval.

Software suite for image archiving and retrieval.

Polar Embedding for Aurora Image Retrieval.

Multiscale distance coherence vector algorithm for content-based image retrieval.

Dictionary Pruning with Visual Word Significance for Medical Image Retrieval.

Multiview locally linear embedding for effective medical image retrieval.

Towards large-scale histopathological image analysis: hashing-based image retrieval.

Local mesh quantized extrema patterns for image retrieval.

Image Retrieval Method for Multiscale Objects from Optical Colonoscopy Images.

Literature-based biomedical image classification and retrieval.

Medical Image Retrieval: A Multimodal Approach.

Content-based histopathology image retrieval using CometCloud.

Vividness of image and retrieval time.

Double nuclear norm-based matrix decomposition for occluded image recovery and background modeling.

Robust 2DPCA with non-greedy l1 -norm maximization for image analysis.

Evaluating performance of biomedical image retrieval systems--an overview of the medical image retrieval task at ImageCLEF 2004-2013.