IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

3241

Learning Object-to-Class Kernels for Scene Classification Lei Zhang, Member, IEEE, Xiantong Zhen, and Ling Shao, Senior Member, IEEE

Abstract— High-level image representations have drawn increasing attention in visual recognition, e.g., scene classification, since the invention of the object bank. The object bank represents an image as a response map of a large number of pretrained object detectors and has achieved superior performance for visual recognition. In this paper, based on the object bank representation, we propose the object-to-class (O2C) distances to model scene images. In particular, four variants of O2C distances are presented, and with the O2C distances, we can represent the images using the object bank by lower-dimensional but more discriminative spaces, called distance spaces, which are spanned by the O2C distances. Due to the explicit computation of O2C distances based on the object bank, the obtained representations can possess more semantic meanings. To combine the discriminant ability of the O2C distances to all scene classes, we further propose to kernalize the distance representation for the final classification. We have conducted extensive experiments on four benchmark data sets, UIUC-Sports, Scene−15, MIT Indoor, and Caltech−101, which demonstrate that the proposed approaches can significantly improve the original object bank approach and achieve the state-of-the-art performance. Index Terms— Object bank, scene classification, object-to-class distances, object filters, kernels.

I. I NTRODUCTION

S

CENE classification is to categorize scene images into a discrete set of semantic classes according to the content of images. It is crucial to image browsing, retrieval, understanding and so on. For browsing and retrieval tasks, scene classification is helpful for narrowing the search space dramatically, while for image understanding, extensive knowledge from different scene categories can provide much more information beyond the images themselves. How much information can we acquire from an image? For human beings, at the first glance of an image, one can get

Manuscript received December 17, 2013; revised April 4, 2014 and May 27, 2014; accepted May 30, 2014. Date of publication June 4, 2014; date of current version June 23, 2014. This work was supported in part by the Support Plan of Young Teachers of Heilongjiang Province and in part by Harbin Engineering University, Harbin, China, under Grant 1155G17. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Vladimir Stankovic. (Corresponding Author: Ling Shao.) L. Zhang is with the College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China (e-mail: [email protected]). X. Zhen is with the Department of Medical Biophysics, University of Western Ontario, London, ON N6A 3K7, Canada (e-mail: [email protected]). L. Shao is with the Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield S1 3JD, U.K. (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2328894

Fig. 1. Relations among object detection, scene classification and image understanding.

a rich amount of semantically meaningful information and even can describe the content of the image by sentences or tell a story behind the picture. It was proved in [1] that, even for video sequences in real-world circumstances, only brief glimpses may be sufficient to confirm or deny it as the expected scene. Image understanding, scene classification and object detection are closely related topics. Fig. 1 gives a rough sketch of the relationship among them. Image understanding is the highest level compared to retrieval or classification. The content understanding may go beyond the image itself. Similarly to speech and text understanding, image understanding is far from just recognizing the objects and their locations in the image. The aim of image understanding lies in digging the information hidden in the image by analyzing the complex relations among different parts of the image. Yao et al. [2] provide a holistic scene understanding approach that simultaneously reasons about regions, locations, classes and spatial extent of objects, as well as scene types. Object detection, however, plays a basic role for both image classification and understanding. No matter for rigid or non-rigid objects, the detection aims at high accuracy under different conditions including luminance changes and view angle variations. Being the middle level, on the one hand, scene classification can well utilize the achievements from object detection and dig the semantic knowledge as scene class labels by analyzing the relations between objects. On the other hand, it can further provide additional knowledge which can be expanded by similar scenes for image understanding. In this paper, we focus on scene classification based on object detection. Scene classification has long been regarded as a challenging task due to the difficulties caused by huge intra-class variations and inter-class ambiguities. Take the scene of a birthday party for instance. The most representative object is the birthday cake, together with the

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

3242

Fig. 2. Illustration of inter-class ambiguity (a) and intra-class variability (b). (a) Two images from the same scene class ‘polo’. (b) Four similar images in appearance from different scene classes of living room, kitchen, office and bedroom from top left to bottom right.

balloons and cheering people. In addition to the different shapes of birthday cakes or other objects, a long-distance shot is quite different from a close-distance shot. The same disparity can be observed in Fig. 2 (a) for a polo scene. The left sub-figure is from the close-distance shot while the right one is from the long-distance shot, which looks significantly different to the other one. In addition, similar objects in similar locations could also be shared by scenes from different categories, which is illustrated in Fig. 2 (b). The television in the living room is extremely similar to the computer in the office while the cabinet in the kitchen looks very alike to the wardrobe in the bed room. Over the past years, many methods, from feature extraction to refined classifier models, have been proposed for scene classification. With respect to feature extraction, global features such as texture statistics and color histograms are extracted from the whole image. These features are regarded as low-level representations, since they are based on pixels, which cannot extract the semantic differences among different scenes. Typically, low-level features are adopted for simple scene classification such as indoor/outdoor [3] scenes. For example, in [4], the simple line drawings are proposed to grasp the key information of scenes. As for local features, most of the methods are built on the bag-of-words (BoW) model and the sparse coding (SC) algorithm. Since these representations can characterize the statistical relations among local descriptors, they are considered as the mid-level features [5]. Although the mid-level features can, to some extent, extract distributional information of local feature descriptors from images, there is still a gap between the mid-level representation and the semantic meaning. In fact, each scene is

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

composed of several objects organized in an unpredictable layout, and objects would also play different roles in different scenes. For example, a bed is more significant than a window for a bedroom scene, while a table is crucial to an office scene. From this point of view, the Object Bank [6] approach which decomposes each scene into a high-dimensional space in which each dimension indicates an object response is reasonable. It can fill the semantic gap by imitating the understanding process of human beings at the first glance of an image. Actually, the Object Bank approach represents an image by the responses of pre-trained object filters, which could be regarded as a high-level representation of the image. Due to the explicit detection of objects in images, the Object Bank provides an effective avenue to understand scene images. However, in the original Object Bank approach, the object filters have no prior knowledge of the distributions of objects in images and the responses of object filters are treated equally in the final image representation. This tends to be less discriminative for classification leading to suboptimal representations. Inspired by the image-to-class (I2C) distance in NBNN, we, in this paper, propose the Object-to-Class (O2C) distances based on the Object Bank for high-level scene representation. In practice, we provide four different versions of the O2C distances. Furthermore, we propose to kernelize the representations based on the O2C distances, which shows impressive effectiveness. Our method enjoys advantages of the Object Bank representation while significantly improves its performance. The benefits of using the O2C distances are two-fold: 1) The O2C distances, different from I2C, are employed to build the subspace representation based on each object, which is more discriminative; 2) O2C provides an intuitive and effective venue to link the distance space to the classification, which improves the performance of classification. The contributions of this work can be summarized in the following aspects: 1) We propose the Object-to-Class (O2C) distances for scene classification. Specially, four variants of the O2C distances are provided; 2) Based on O2C distances, a kernelization framework is built to map the Object Bank representation into a new distance space leading to more discriminative ability; 3) Our method combines the benefits of the high-level representation of the Object Bank with the kernel methods. The remainder of this paper is organized as follows. We briefly review the related work in Section II. The Object Bank representation is summarized in Section III. The objectto-class (O2C) distances are introduced in Section IV. The basic O2C distance and its variants are presented in Section V. The kernelization of the O2C distance representation is given in Section VI. Finally, we show experiments and results in Section VII and conclude our work in Section VIII. II. R ELATED W ORK Representations based on local features have dominated the field of image/scene classification. Interest points are firstly extracted from the images by detectors [7], [8] and then described by local descriptors such as SIFT [9] and

ZHANG et al.: LEARNING O2C KERNELS FOR SCENE CLASSIFICATION

DAISY [10]. The key issue is then to measure the similarity between two images on sets of local descriptors, which, however, is non-trivial due to the different cardinality and orderlessness of local features from different images. Over the past decades, approaches based on the bag-of-words (BoW) model and the sparse coding algorithm have achieved impressive results in many challenging tasks. The representations from both BoW and SC can be regarded as mid-level features [5]. The main deficits of the BoW model and SC are the quantization errors and the loss of structural information of local features. Many efforts have been made to alleviate the quantization errors caused by the hard assignment method and to compensate for the loss of the location information of local descriptors. To deal with the quantization errors, the soft assignment was proposed to encode local descriptors by multiple visual words [11]. They apply techniques from kernel density estimation to allow a degree of ambiguity in assigning local feature descriptors to codewords. By using kernel density estimation, the uncertainty between codewords and image features is lifted beyond the vocabulary and becomes part of the codebook model. The sparse coding algorithm circumvents this problem by relaxing the restrictive cardinality constraint in vector quantization (VQ) [12]. Furthermore, the locality constraint has been further imposed on the encoding [13], [14], which has proven to be crucial to classification. The pyramid match kernel (PMK) [15] approximates the optimal partial matching by computing a weighted intersection over multi-resolution histograms, which implicitly finds correspondences based on the finest resolution histogram cell where a matched pair first appears. Bo et al. [16], [17] proved that the BoW representation can be formulated in terms of match kernels over image patches. This novel view combines BoW and kernelization to some extent, and is helpful for designing a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes into compact patch-level features. However, all the above methods suffer from the inability to encode sufficient structure of local descriptors, which, however, is important for modeling scene images. To cope with the loss of structure, spatial pyramid matching (SPM) [18], a special case of pyramid match kernel (PMK) [15], is proposed to incorporate location information into the BoW model for image representation. An image is partitioned into increasingly fine sub-regions and a histogram of local features is formed inside each sub-region. Apart from being utilized in spatial pyramid matching, spatial layout also plays an important role in real-world scene understanding. In [19], spatial envelope of an environment is made by a composite set of boundaries, such as walls, sections and ground, to build relations between the outlines of the surfaces and the properties including the inner textured pattern. In addition, many topic models have also be explored for image modeling and classification, ranging from the probabilistic latent semantic analysis (pLSA) [20] and latent Dirichlet analysis (LDA) [21] to the hybrid generative and discriminative approach [22]. Bosch et al. [20] learned categories and their distributions in unlabelled training images

3243

by pLSA. Fei-Fei and Perona [21] represented each region as part of a theme while the theme distributions as well as the codeword distributions over the themes were learned without supervision by modifying the LDA approach. Although both LDA and pLSA were originally proposed for text/document analysis, they have also demonstrated remarkable performance in scene classification. The advantage of these two models lies in that they add the hidden variables to enhance the model representation abilities. As for the hybrid generative and discriminative approach in [22], at the generative learning stage, the latent topic is firstly discovered by pLSA and then at the discriminative learning stage, a multi-class classifier using SVM are trained based on the topic distribution vectors. With the development of object recognition [15], [23], [24], a promising direction is to analyze images with high-level image features. A well-known algorithm is the Object Bank representation [6], in which images are decomposed into different components associated with object filters. In [6], it is shown that the low-level features can not handle ambiguities among different scenes. Instead, the Object Bank, which represents the image on a collection of object sensing filters, can deal with this kind of ambiguity elegantly. Even if the object detectors could not exactly determine all objects, the responses to different detectors can still capture much of the variation of the corresponding objects. This can encode abundant prior information of the visual space. Additionally, the Object Bank treats the object labels as attributes to help the final scene classification. Recently, attempts have been made to improve the performance of the Object Bank [25], [26]. In [25], an optimal object bank (OOB) by imposing weights on the detectors according to their discriminative abilities has been proposed, while in [26], Zhang et al. proposed to project the high-level features from the Object Bank representation into discriminative subspaces, obtained by clustering the features in a supervised way, to attain a more compact and discriminative representation. Another interesting direction in scene recognition is to challenge the most difficult task, i.e., the overlap problem. In this scenario, the class labels are not mutually exclusive, meaning that there exist some images belonging to multiple classes. To address this task, Boutell et al. [27] proposed a cross-training algorithm, which has later been extended to multi-instance multi-label learning [28]. III. O BJECT BANK R EPRESENTATION The core idea in the Object Bank representation is to decompose an image according to a pre-defined object filter bank. Specifically, when an object filter traverses all pixels in one image, only the maximum of the responses is meaningful to represent how likely the corresponding object occurs in this image. By concatenating the max response of each filter, we can generate the representation with each dimension corresponding to one object filter with a certain configure (scale, location and profile). From a semantic point of view, the obtained representation can give the details about the

3244

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

TABLE I S OME O BJECTS U SED IN 177 O BJECT F ILTERS

content of the image on behalf of how likely each object in the object list bank appears in this image. We revisit the Object Bank approach from two aspects: the object filter generation and the Object Bank representation, in the following subsections. A. Object Filters The essence of the Object Bank representation could be considered as a mapping procedure, which projects images into a new space spanned by objects. In the new space, dimensions correspond to the responses of object filters (Note that an object has multiple filters with different scales and profiles), and an image is just one point whose distance to each dimension shows the likelihood of the image containing this object. In this representation, how to select the object list or object bank and how to build them are important. Theoretically, the number of objects during decomposition could be infinite, while only a small proportion of the objects appear in images with high frequencies. This is analogous to the Zipf’s law [29] known in the natural language processing area, which implies that only a small proportion of words account for the majority of documents. In [6], according to the frequency of occurrences of objects in different datasets, 177 of the most frequent objects are selected. Table I shows some of the objects used in the 177 object filters. With regards to object filters, different kinds of objects can be generated by distinctive approaches. For instance, the latent SVM object detectors [30] are suitable for most of the blobby objects (tables, cars, humans, etc.) while a texture classifier [31] is better for texture and material based objects (sky, road, sand, etc.). Fortunately, the availability of largescale image datasets, e.g., LabelMe [32] and ImageNet [33], makes it possible to obtain object detectors for a wide range of visual concepts. Fig. 3 shows some examples of object filters. It can be seen that for some objects with simple contours such as wheel and fork, the corresponding filters can give the more distinctive details. But for some complex objects like monkey or dog, the difference between them is not significant enough to distinguish them only by the filter templates. The ambiguities will produce additional difficulty during classification. In principle, by selecting a proper object list, the main content of an image can be sufficiently represented

Fig. 3. The profiles of object filters. (a) Fork (left) and Wheel (right). (b) Monkey (left) and Dog (right).

by the responses of the object filters with semantic concepts. B. Object Bank This subsection describes how to reflect the probability of each object occurring in an image and how to deal with the probabilities from all object filters. To answer the first question, the normalized max response of the filter can be treated as the corresponding probability. Given an image G and a filter F in the object bank, the response of the filter at the point (x, y) in the image is the sum of the products of the filter coefficients and the corresponding neighborhood points in the area spanned by the filter mask, which can be formulated as:      (1) F x , y  · G x + x , y + y  (x  ,y  )∈neighborhood o f

(x,y)

Moving the center point (x, y) to go through all the pixels in the image, we can obtain the responses from the filters for all the pixels. Since each filter can reflect the outline of the object to some extent, the sum operation in Eq. (1) essentially calculates the similarity between the object in the filter and a patch around the pixel in the image. If normalized, the maximum value can be viewed as the probability of the object occurring in the image.

ZHANG et al.: LEARNING O2C KERNELS FOR SCENE CLASSIFICATION

As in [6], if each object is described by filters with two profiles (front and side), 6 scales and a 3-level spatial pyramid (with 1 + 4 + 16 sub-regions), there will be 177(obj ect) × 2( pr o f ile) × 6(scale) × 21(sub-r egi on number ) = 44604 maximum responses for each image. For the values of maximum responses from different objects, the largest ones indicate the objects that are most probably contained in the image. Some of the objects used in the 177 object filters are listed in Table I. Instead of only using the most proper object response [34], we concatenate all the maximum responses from object filters to encode richer semantic information of the content in the image. Then one image is just a point in the high-dimensional space spanning by the object filters, which is similar to the idea of the bag-of-words model where the space is spanned by visual words in the vocabulary. Instead of using the word frequency in one document as the feature vector in the vector space, here, the max response of each object is adopted in the final Object Bank representation. This can be explained as how likely this object happening in the corresponding image, with the similar meaning of the word frequency in document analysis. IV. O BJECT-T O -C LASS D ISTANCES Inspired by the success of Image-to-Class (I2C) distance in NBNN [35] as well as its extensions such as the NBNN kernel [36] and local NBNN [37], in this paper, we propose the Object-to-Class (O2C) distances for high-level object representation, which generalizes the Object Bank representation. Before introducing our object-to-class distances, we revisit the image-to-class distance. Additionally, comparison between the image-to-image distance and the image-to-class distance is also described. With regards to the Object Bank representation, we provide a different view from subspace decomposition, based on which the object-to-class distance is introduced. A. Image-to-Class Distance Models based on local features have achieved state-of-theart results in many visual object recognition tasks. An image can be described by a collection of local feature descriptors, e.g., SIFT, extracted from patches around salient interest points or regular grids. How to measure the similarity between two images represented as sets of local features becomes a basic problem in both image classification and matching. The problem is non-trivial due to that the cardinality of the set varies with different images and the elements are unordered [16]. The bag-of-words model can be deemed as the most widely used algorithm for image representation based on local features. Local feature descriptors for an image are projected into a new representation space spanned by visual words in a vocabulary, which yields a histogram as a fixed-length vector to represent this image. In this case, the image-toimage (I2I) distance is typically computed to generate kernels to feed the support vector machine (SVM) for classification. Since the histogram actually expresses the distribution of local descriptors over the dictionary, the I2I distance in fact measures the distance between distributions.

3245

However, if we measure the distance between distributions of local features from one image and from a set of images in one class, then the distance becomes the image-toclass (I2C) [35] distance in the naive Bayes nearest neighbor (NBNN) classifier. Due to the use of the I2C distances, which actually deal with the large intra-class variations, the NBNN classifier obviates the quantization errors in the BoW model and proves to be successful in image classification. The I2C distance from an image to a candidate class is formulated as the sum of all the Euclidean distances from local feature descriptors in this image to their corresponding nearest neighbor descriptors searched from the descriptor set in the candidate class. This distance similarity measure directly deals with each image represented by a set of local descriptors. This resembles having a huge ensemble of very weak classifiers associated with local feature descriptors, which can exploit the discriminative power of both high and low informative descriptors [35]. B. Subspace Perspective of Object Bank In the original Object Bank approach, all maximum responses from different object filters, scales, profiles and levels of pyramid are concatenated as a huge vector to represent the content of the image, as shown in Fig. 4. If we use I to represent this vector, as analyzed above, the representation belongs to the space of R 44604. One of the disadvantages of this representation is being less discriminative due to that all the object filters are equally treated. As shown in Fig. 2(b), the bed class is irreplaceably crucial for the bedroom scene. Since different objects play distinct roles in the same scene class, and even for the same object, the effects on different scene classes could also be different, it is sensible to consider these objects separately and distinctively. In fact, it is assumed in the Object Bank approach that objects are independent to each other. Therefore, for the representations, responses from all the object filters can be treated separately. The space with 44604 dimensions is then divided into 177 subspaces, with each one corresponding to one object with 2 × 6 × 21 = 252 dimensions. Fig. 4 shows the difference between our subspace decomposition and the traditional Object Bank approach. Once we decompose the Object Bank representation into the object subspaces, we can treat different objects separately. Also, we can divide the image into parts that correspond to object subspaces. C. Object-to-Class (O2C) Distances In this subsection, we will introduce our newly proposed object-to-class (O2C) distances, which are inspired by the success of image-to-class (I2C) distance in the NBNN classifier. Given an image Q represented as a set of subspace vectors, I1 , . . . , In , . . . , I N , where In ∈ R D is the subspace associated with the n-th object and D = 252 is the dimensionality of each subspace. We can find the class of the image Q by the Maximum-A-Posterior (MAP) classifier. Similarly to the NBNN classifier, with the assumption that the class

3246

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 4.

Difference between Object Bank and our subspace treatment.

prior p(c) is uniform, the MAP classifier reduces to the Maximum-Likelihood (ML) classifier as: cˆ = arg max p(c|Q) = arg max p(Q|c). c

c

(2)

Under the Naive-Bayes assumption that I1 , . . . , In , . . . , I N are i.i.d. given its class c, we have: p(Q|c) = p(I1 , . . . , I N |c) =

N 

p(In |c),

(3)

n=1

where p(In |c) can be approximated using the non-parametric Parzen density estimation. Taking the log probability of the ML decision rule, we rewrite Eq. (3) as: cˆ = arg max log p(Q|c) = arg max c

c

N 

log p(In |c)

(4)

n=1

ˆ n |c), by For the estimation of p(In |c), denoted as p(I properly selecting the kernel function in the Parzen likelihood estimation, we can turn the probability into a distance representation as follows: L 1 K (In − Icn ( j )), p(I ˆ n |c) = L

(5)

Fig. 5. Illustration of probability density of responses from the object filters: (a) Carpet and (b) Truck.

can be negligible by only keeping the r nearest neighbors of object n. Then Eq. (5) becomes: pˆ N N (In |c) =

(6)

j =1

j =1

where L is the total number of n-th object’s responses from class c and Icn ( j ) corresponds to the j t h response of the object n in the class c. K (·) is a kernel function, which determines the range of covered samples. Theoretically, as L goes close to infinite, the estimate distribution p(I ˆ n |c) converges to the true one p(In |c). Considering the long-tail characteristic of the object response distribution as shown in Fig. 5, most of terms in the summation of Eq. (5)

r 1 j K (In − N Nc (In )), L

j

where N Nc (In ) means the j t h nearest neighbor of object n of the query image in the class c. For an extremely simple case, j we set r = 1 as in NBNN, where N Nc (In ) can be expressed as N Nc (In ). K (·) in Eq. (5) and (6) denotes the kernel function which is typically chosen as a Gaussian kernel in Eq. (7). In practice, the Gaussian assumption is reasonable. We show two examples in Fig. 5, from which we can see that the maximum responses

ZHANG et al.: LEARNING O2C KERNELS FOR SCENE CLASSIFICATION

from different object filters obey Gaussian distributions.   1 2 In − N Nc (In ) K (In − N Nc (In )) = exp − 2σ

3247

Algorithm 1 The Basic O2C Distance d N N (In , c) (7)

Furthermore, assuming that the kernel bandwidths in the Parzen function are the same for all the classes, then the log function and the exp function can be merged as: cˆ = arg max c

= arg max c

= arg min c

N 

log p(In |c)

n=1 N  



n=1 N 

1 In − N Nc (In )2 2σ



(In − N Nc (In )2 )

(8)

n=1

We define d N N (In , c) = In − N Nc (In )2

(9)

as the Object-to-Class (O2C) distance from the object In to the class c. It is associated with subspace n, which means this distance may be different for different objects. The definition of the O2C distance is inspired by the I2C distance while they are essentially different. We summarize the differences between O2C and I2C in the following three aspects: 1) The I2C distance is defined as the distance from an image to a class which is the sum of the distances from all the local features, e.g., SIFT, of the image to their corresponding nearest local features in a class. The basic O2C distance is defined as the distance from an object (which is represented as responses of a series of object filters and may/may not appear in an image) to its nearest object in a class. In a nutshell, the I2C distance is the sum of many distances between local features while the O2C distance is the distance between two objects (responses of two object filters). In addition, we have also provided several variants of the O2C distance based on the basic definition which makes O2C further more different from the I2C distance. 2) In contrast to the I2C distance, the O2C distance has explicit semantic meaning because the response of an object filter to an image indicates the probability of this object appearing in the image, while the local features such as SIFT are just low-level features without semantic meaning. 3) An important advantage of the O2C distance over the I2C distance is that O2C is defined on a bank of ordered object filters (while local features in I2C distances are orderless). This makes the O2C distance more flexible in the construction of kernels. V. BASIC O2C D ISTANCES AND I TS VARIATIONS In this subsection, several variants of the O2C distance which reflect the different effects of an object on scene classes are proposed.

Fig. 6.

Sketch map for the effect of anchors.

A. The Basic O2C Distance As defined in Eq. (8), we just adopt the distance from In to its nearest neighbor in class c as the O2C distance d N N (In , c), which we take as a baseline of the O2C distance. Suppose Q n is the training set in the subspace of the object n and Q cn represents the samples from the class c. The algorithm to compute the O2C distance d N N (In , c) is given in Algorithm 1. In the basic O2C distance, the candidate competitor is just selected as the nearest neighbor. It is suitable for the scenarios where, for any object In of a test sample, its nearest neighbor belongs the right class. However, when there are noisy samples, this assumption would be violated leading to inferior performance. B. O2C Distance With Anchors As is known in NBNN, one of the most important advantages is the avoidance of the vector quantization, which, however, leads to two shortcomings. On the one hand, due to the nearest neighbor search, the computational cost would be extremely high, especially when there are a huge number of training samples with high dimensions. On the other hand, as mentioned above, in practice, there always exist noisy samples, which make the basic O2C distance less discriminative and unreliable. Inspired by the previous work in [25] and [26], we propose the O2C distance d A (In , c) with anchor points which can be cluster centers obtained by the k-means clustering algorithm. The difference between O2C with anchors d A (In , c) and the basic d N N (In , c) is shown in Fig. 6 from which it can be seen that the anchor points possess better generalization properties than the nearest neighbor points. The red star represents the test sample, and the length of the dashed line means the NN distance while the length of a real line denotes the distance to an anchor point. For the points in the region of intersection

3248

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Algorithm 2 O2C Distance With Anchors d A (In , c)

Algorithm 3 O2C Distance With Locality d L (In , c)

between two classes, anchors are helpful for correcting the mistake for distance computing. So in d A (In , c) distance of O2C in Algorithm 2, the objects in each class are firstly clustered by the k-means algorithm, then the minimum distance between In and the cluster center can be viewed as the final distance of object n to the class c. C. O2C Distance With Locality The idea of locality has been extensively exploited in many machine learning algorithms including manifold learning [38], sparse coding [13] and NBNN as well [37]. Inspired by its success in local NBNN, we incorporate the locality into our O2C distance. Supposing that different classes for object In are not equally important, here, instead of finding the nearest neighbor from each class, we only focus on those classes in the neighborhood of In . We search k-nearest neighbors over the whole training set containing all classes and only the classes that have samples in the local neighborhood of the object are considered to contribute significantly. The other classes can be treated as the background classes, since they are less relevant to In . In order to assign a value for those background classes, during the k-nearest neighbors searching procedure, we extend the neighborhood by including an additional sample, which is the (k + 1)-th nearest neighbor of In . More specifically, among all classes, only those with at least one sample falling in this local neighborhood are important for discriminative representation, while for the rest of classes, the distances are fixed to be the (k + 1)-th nearest neighbor. It is analogous to draw a hypersphere around In , and only k samples are allowed to be included according to the distances to In . Moreover, for those classes which have samples falling in the hypersphere, we give them a special mark for particular focus, while for the rest without samples in the hypersphere, we treat them equally. The calculation is summarized in Algorithm 3. In this algorithm, only the classes with samples falling into the k-nearest neighbors have candidate competitors, which are the nearest points to the test sample. D. O2C Distance With Class Locality We would like to look into the neighborhood of the locality incorporated in d L . In Fig. 7, we plot the statistics on different

object subspaces and different sizes of neighborhood of each sample for the UIUC-Sports training dataset with 8 classes. For the same object subspace (n = 1, k = 5 is in blue line and n = 1, k = 45 is in green line), with the increase of the scale of the neighborhood, the number of classes included in it is also increased. But even when k = 45, for some samples, the neighborhood still can not cover all classes. For the comparison between the brown line and the green line, it is manifested that different object subspaces possess different neighborhoods covering different classes. Table II shows the maximum distances between some samples and their k-nearest neighbors for object subspace 1 and the k +1 neighbor. Obviously, for the second sample, even the distance out of k-nearest neighbors is still smaller than the distance within k-nearest neighbors of other samples. These findings motivate us to consider the locality by incorporating the class label information. To deal with this problem, we search for candidate competitors of scene classes with weak class number information. That is we fix the class number in the neighborhood instead of the sample number in the neighborhood. It can compensate for the shortcoming that we have no idea about how many classes will participate into computing the O2C distance in Algorithm 3. Additionally, in order to verify whether the background classes are helpful for classification or not, we propose a new O2C distance d C L , namely O2C distance with class locality. It enlarges the neighbor set to ensure a specific number of classes will happen in the neighborhood. That means the size of neighborhood varies to keep a fixed number NC of classes involved in computation of the O2C distance. The calculation of d C L is described in Algorithm 4. VI. O2C K ERNELS Having the O2C distances defined above, we can predict a test sample by choosing the class with the smallest sum of the O2C distances over the object subspaces. In this scenario, the only concern is whether the distance of the test sample to the class it belongs to is the shortest one or not, while the

ZHANG et al.: LEARNING O2C KERNELS FOR SCENE CLASSIFICATION

3249

Fig. 7. Statistics of the numbers of classes in the neighborhood in Algorithm 3, where k is the number of the nearest neighbors and n indexes the object subspace. TABLE II C OMPARISON B ETWEEN M AX D ISTANCE A MONG THE k N EAREST N EIGHBORS AND THE (k + 1)th D ISTANCE

where C = {1, ..., c, ..., C} is the set of all classes. Furthermore, in order to combine the O2C distance into (y) kernel function K (X, Y ), local kernel k c (I(x) n , In ) is defined as: (y)

(y)

c (x) T c k c (I(x) n In ) = φ (In ) φ (In )

Algorithm 4 O2C Distance With Class Locality

d C L (I

(x) T = f c (d(I(x) n , 1), ..., d(In , C))

n , c)

(y)

(y)

× f c (d(In , 1), ..., d(In , C))

(11)

(x) d(In , c)

denotes the O2C distance from the object where to the class c. Note that the distance could be any I(x) n one of the above O2C distances. To be convenient, we denote the kernels with the different distances as K N N , K A , K L and K C L corresponding to d N N , d A , d L and d C L , respectively. (y) (x) Through the kernel function, In and In are not compared directly. Instead, the corresponding distances to classes are contrasted with each other. Even if the two features I(x) n (y) and In are far apart in the original feature space, they are considered to be close if they have similar distances to each (x) (x) class. In practice, the function f c (d(In , 1), ..., d(In , C)) is rewritten as: (x) (x) f c (d(I(x) n , 1), ..., d(In , C)) = d(In , c)

(12)

distances of the test sample to other classes are ignored. In fact, the distances to all scene categories can contain much discriminative information for classification. In order to effectively utilize such information, we propose to kernelize the distances on O2C, namely, the discriminative kernels on the O2C distances.

We conduct comprehensive experiments on the UIUC-Sports, Scene-15, MIT Indoor and Caltech101 datasets which range from generic natural scene images (Scene-15) to complex event and activity images (UIUC-Sports).

A. Match Kernels

A. Datasets and Settings (x)

(x)

(x)

Given two sets of features as X = {I1 , ...In , ...I N } (y) (y) (y) and Y = {I1 , ...In , ...I N }, which could be the Object Bank representation, the normalized sum match kernel [39] is selected as it satisfies the mercer condition in Eq. (10):  K (X, Y ) = K c (X, Y ) c∈C

=

1    c (x) (y) k (Im , In ) |X||Y | m n c∈C

(10)

VII. E XPERIMENTS

UIUC-Sports [40] was firstly introduced in [40], which consists of 8 sports event categories. The number of images in each class ranges from 137 to 250. We follow the experimental settings in [6] by randomly selecting 70 images as the training set and 60 images as the test set, respectively. Scene-15 [18] is a dataset of 15 natural scene classes which contains scenes from our daily life such as bedroom, kitchen to some outdoor natural scenes such as mountain, coast, and

3250

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

so on. We use 100 images in each class for training and the rest for testing. MIT Indoor [41] contains 15620 images over 67 indoor scenes assembled by [41]. We follow the experimental setting in [41] by using 80 images from each class for training and 20 for testing. Caltech-101 [42] was collected using Google Image Search, and the images contain significant clutter, occlusions, and intra-class appearance variance. Each class has at least 31 images and the total class number is 102 including a background class. We use 30 images from each class for training and 20 images for testing. Among these approaches, the basic distance between two object subspace representations In (i ) and In ( j ) can be chosen as either one of the following: d M (In (i ), In ( j )) =

D 

|In (i )d − In ( j )d |

(13)

(In (i )d − In ( j )d )2 ,

(14)

d=1

and d E (In (i ), In ( j )) =

D   d=1

where d M and d E denote the Manhattan distance and the Euclidean distance respectively. A linear SVM classifier [43] is employed for the final scene classification. The ‘one-against-the rest’ is employed for multi-class classification. The parameters including γ and C is obtained by cross validation in our experiments. In terms of computational complexity, we compare with the work in [16] which is representative work using kernel methods. In our method, the computational complexity of kernel matrix is O(n 2 d), where n is the number of images in the training set and d is dimensionality. Furthermore, after space mapping in our method, the dimensionality is reduced to NC × 177, where NC is the number of image classes. In [16], the complexity of kernel matrix is O(n 2 m 2 d), where m is the average number of points in one image which is always around 2000. Therefore, our methods are more efficient. B. Results on Scene-15 The performance of the kernels associated with different O2C distances on the Scene-15 dataset is reported in Table III. The kernel K C L based on the proposed O2C distance d C L (In , c) achieves the best results and significantly outperforms the other kernels. Among the first three rows, K N N yields significant performance with all results over 84.6% for both distances d M and d E , which indicates that it is reasonable to adopt the nearest neighbor to represent the corresponding class. For K A , where the cluster anchor number is set 10, the original idea in this distance is to exploit the generalization capabilities of vector quantization. From the results in Table III, we can observe that the generalization does take effect as expected on the Scene-15 dataset. The improvement is limited which could be due to the sparsity of training samples, and it cannot guarantee that all anchors in a scene class could well represent the samples

TABLE III P ERFORMANCE C OMPARISON ON THE S CENE -15 D ATASET AND UIUC-S PORTS W ITH F OUR P ROPOSED O2C D ISTANCES

from this class. While for K L , it just counts the contribution to the classes found in the local neighborhood with k nearest neighbors. The performance here is compromised by not taking into consideration of the label information of samples in the local neighborhood, which tends to be less discriminative. But the computational complexity is much lower than K N N , especially for datasets with large-scale scene categories and a huge number of training samples. For the last row in Table III, K C L on the O2C distances d C L (In , c) with class locality yields the best performance. In computation of this distance, the number of classed in the local neighborhood is fixed, and is kept the same for all samples. When computing the distance to a certain class, it selects the nearest neighbor among all the neighbors including NC classes. Compared with K L , the number of the neighbors varies to ensure the same number of classed in the locality, which makes the computed I2C distances more discriminative. We have also experimented to investigate effects of the parameters in each algorithm. For the O2C distance with locality d L (In , c), the only parameter is the number k of nearest neighbors. We have tested k ranging from 3 to 45, and the results are shown in Fig. 8 (a). We can see that the performance keeps going up with the increase of k, and reaches its peak with k = 40 for both distances of d M and as d E . After k = 40, the performance decreases due to the limited number of training samples. If the neighborhood region determined by k covers almost two many samples in the training set, the locality constraint trends to be less effective. With regards to the O2C distance with class locality, d C L (In , c), the number of NC classes in the neighborhood is the most important parameter. The experimental results are shown in Fig. 8 (b). NC indicates how many classes should be included in the neighborhood for each sample. Note that in this case the sample number in the neighborhood is uncertain. For the Scene-15 dataset, the largest class number is 15, so we test NC ranging from 3 to 15. For this case, the trend of d M and d E is not consistent with that in Fig. 8 (b). Besides the rapid dropping for d E with NC = 15, in Fig. 8 (a), the performance of Manhattan distance d M is better than that of Euclidean distance d E , while the situation is the opposite in Fig. 8 (b). C. Results on UIUC Sports In contrast to the Scene-15 dataset, the UIUC Sports database is more complicated with larger interclass ambiguity as shown in Fig. 2 (a). The same procedure as that conducted on

ZHANG et al.: LEARNING O2C KERNELS FOR SCENE CLASSIFICATION

3251

Fig. 8. The effects of the nearest neighbor number k in d L (In , c) and the class number NC in d C L (In , c) on Scene-15 (a and b), UIUC Sports (c and d), MIT Indoor (e and f), Caltech-101 (g and h).

the Scene-15 dataset is performed to verify the effectiveness of the proposed methods. Fig. 8 (c) and Fig. 8 (d) illustrate effects of the parameters of the k nearest neighbors in K A and NC in K C L , respectively. It shows the similar performance for the Manhattan distance and the Euclidean distance in both Fig. 8 (c) and Fig. 8 (d). The best classification rates with those two distances happen at k = 25 with 82.46% and 82.42%, respectively and at NC = 3 with 86.02% and 85.79%. Table III also summarizes the results on the UIUC-Sports dataset for the proposed methods. Among these four O2C distances, K C L consistently yields the best performance which is similar as on the Scene-15 database. For the first three rows in Table III, the performance stays almost on the same level. The reason is the same as the above analysis for the scene15 categories database. While for K C L , the improvement is obviously more significant.

D. Results on MIT Indoor Compared to the UIUC Sports and Scene-15 datasets, the MIT Indoor dataset contains more image categories. Fig. 8 (e) shows the performance with different nearest neighbors k used in the computation of d L (In , c). The performance of d M and d E is consistent with the best results under 60 nearest neighbors. Fig. 8(f) illustrates the performance of d C L (In , c) under different class numbers NC . The results of different O2C distances with d M and d E are summarized in Table IV, in which the best result 39.90% is achieved by K A with d M . Different form the results on UIUC Sports and Scene-15, K L and K C L produce relatively low results, which would result from the larger number of image categories in this dataset. This trend can also be found in the following Caltech-101 dataset.

TABLE IV P ERFORMANCE C OMPARISON ON MIT I NDOOR AND C ALTECH -101 W ITH THE F OUR P ROPOSED O2C D ISTANCES

E. Results on Caltech-101 The Caltech-101 dataset contains even more class categories than the UIUC Sports, Scene-15 and MIT Indoor datasets, which poses more challenges for classification due to the larger variations of backgrounds. Similarly, the effects of the number k of nearest neighbors on the performance of d L (In , c) and the class number NC on the performance of d C L (In , c) for this dataset are shown in Fig. 8 (g) and (h), respectively. The best results with different k and NC are 64.26% and 62.79% for d L (In , c) and d C L (In , c), respectively. The results of different O2C distances with d M and d E are also reported in Table IV. On this dataset, K A can also achieve the best result, which is consistent with that on the MIT Indoor dataset. F. Comparisons With Other Methods To show the contributions of the proposed methods for scene classification, we have also conducted a comparison with the state-of-the-art algorithms on the four datasets. The results on these four datasets, Scene-15, UIUC-Sports, MIT Indoor and Caltech-101, are reported in Table V. The proposed methods can significantly outperform the original Object Bank

3252

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

TABLE V P ERFORMANCE C OMPARISON OF THE P ROPOSED A PPROACHES W ITH S TATE OF THE A RTS

distance representations to obtain the O2C kernels, which not only take the advantages of the high-level representation of the Object Bank, but also generalize the distance representation. To validate the proposed O2C kernels, we have performed comprehensive experiments on four widely used image datasets including Scene-15, UIUC-Sports MIT Indoor and Caltech-101. We have also provided extensive investigation of the parameters in our methods. The results on the four datasets have consistently shown the proposed methods can significantly improve the original Object Bank approach and achieve comparable performance with state-of-the-art methods for image classification. R EFERENCES

approach for all the four datasets, which demonstrates the effectiveness of our work. It is worth to mention that for the original Object Bank with a linear SVM classifier, the accuracy on the Scene-15 and UIUC-Sports datasets are 82.03% and 77.50% respectively, which are lower than the methods proposed in this work. On the MIT Indoor and Caltech-101 datasets with larger class numbers, the side effects of background class become significant, which causes degradation of system performance. However, the proposed K N N and K A approaches can still improve the original Object Bank approach with large margins. Compared with state-of-the-art approaches, to the best of our knowledge, K C L outperforms most of published results on the Scene-15 and UIUC-Sports datasets. For MIT Indoor, the best performance of four proposed algorithms is 39.85%, which is also comparable with the performance of the state-ofthe-art methods. The Caltech-101 dataset contains 101 image categories, which is even comparable with the number (177) of object filters listed in Table I. Relatively to the large number of image categories, the number of object filters is small and therefore the representation ability will become weak, which is the reason why the Object Bank approach as well as the proposed methods based on the Object Bank cannot beat the state-of-the-art methods. However, creation of more object filters will definitely improve the performance of methods based on the Object Bank. VIII. C ONCLUSION In this paper, based on the Object Bank representation, we have proposed the Object-to-Class distance for scene classification. Four variants of the O2C distances, namely the O2C distances with the nearest neighbor, anchors, locality and class locality, are considered with different scenarios. With the O2C distances, the representations obtained from the Object Bank are mapped into a more compact but discriminative space, called distance space. The obtained distance representations carry sufficient semantic meaning due to the use of the Object Bank. To classify the scenes, we then further kernelize the

[1] M. C. Potter, “Meaning in visual search,” Science, vol. 187, no. 4180, pp. 965–966, 1975. [2] J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2012, pp. 702–709. [3] M. Szummer and R. W. Picard, “Indoor-outdoor image classification,” in Proc. IEEE Int. Workshop Content-Based Access Image Video Database, Jan. 1998, pp. 42–51. [4] D. B. Walther, B. Chai, E. Caddigan, D. M. Beck, and L. Fei-Fei, “Simple line drawings suffice for functional MRI decoding of natural scene categories,” Proc. Nat. Acad. Sci., vol. 108, no. 23, pp. 9661–9666, 2011. [5] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-level features for recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2010, pp. 2559–2566. [6] L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei, “Object bank: A high-level image representation for scene classification and semantic feature sparsification,” in Proc. Adv. Neural Inform. Process. Syst., vol. 24. 2010, pp. 1378–1386. [7] K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” Int. J. Comput. Vis., vol. 60, no. 1, pp. 63–86, 2004. [8] K. Mikolajczyk et al., “A comparison of affine region detectors,” Int. J. Comput. Vis., vol. 65, nos. 1–2, pp. 43–72, 2005. [9] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [10] E. Tola, V. Lepetit, and P. Fua, “DAISY: An efficient dense descriptor applied to wide-baseline stereo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5, pp. 815–830, May 2010. [11] J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J.-M. Geusebroek, “Visual word ambiguity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7, pp. 1271–1283, Jun. 2010. [12] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 1794–1801. [13] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 3360–3367. [14] L. Liu, L. Wang, and X. Liu, “In defense of soft-assignment coding,” in Proc. IEEE Int. Conf. Comput. Vis., ICCV, Nov. 2011, pp. 2486–2493. [15] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in Proc. IEEE Int. Conf. Comput. Vis., ICCV, vol. 2. Oct. 2005, pp. 1458–1465. [16] L. Bo and C. Sminchisescu, “Efficient match kernel between sets of features for visual recognition,” in Proc. Adv. Neural Inform. Process. Syst., 2009, pp. 135–143. [17] L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recognition,” in Proc. Adv. Neural Inform. Process. Syst., 2010, pp. 244–252. [18] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, vol. 2. Jun. 2006, pp. 2169–2178. [19] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, 2001. [20] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification via pLSA,” in Proc. Eur. Conf. Comput. Vis., ECCV, 2006, pp. 517–530.

ZHANG et al.: LEARNING O2C KERNELS FOR SCENE CLASSIFICATION

[21] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, vol. 2. Jun. 2005, pp. 524–531. [22] A. Bosch, A. Zisserman, and X. Muoz, “Scene classification using a hybrid generative/discriminative approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008. [23] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learning for visual recognition,” Int. J. Comput. Vis., vol. 109, no. 1–2, pp. 42–59, Aug. 2014. [24] L. Shao, L. Liu, and X. Li, “Feature learning for image classification via multiobjective genetic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 7, pp. 1359–1371, Jul. 2014. [25] L. Zhang, S. Xie, and X. Zhen, “Towards optimal object bank for scene classification,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP, May 2013, pp. 1967–1970. [26] L. Zhang, S. Xie, L. Shao, and X. Zhen, “Discriminative high-level representations for scene classification,” in Proc. IEEE ICIP, Sep. 2013, pp. 4345–4348. [27] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771, 2004. [28] Z.-H. Zhou and M.-L. Zhang, “Multi-instance multi-label learning with application to scene classification,” in Proc. Adv. Neural Inform. Process. Syst., 2006, pp. 1609–1616. [29] R. Edwards and L. Collins, “Lexical frequency profiles and Zipf’s law,” Lang. Learn., vol. 61, no. 1, pp. 1–30, 2011. [30] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, Sep. 2010. [31] D. Hoiem, A. A. Efros, and M. Hebert, “Automatic photo pop-up,” ACM Trans. Graph., vol. 24, no. 3, pp. 577–584, 2005. [32] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “LabelMe: A database and web-based tool for image annotation,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 157–173, 2008. [33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 248–255. [34] J. Vogel and B. Schiele, “Semantic modeling of natural scenes for content-based image retrieval,” Int. J. Comput. Vis., vol. 72, no. 2, pp. 133–157, 2007. [35] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighbor based image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2008, pp. 1–8. [36] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell, “The NBNN kernel,” in Proc. IEEE Int. Conf. Comput. Vis., ICCV, Nov. 2011, pp. 1824–1831. [37] S. McCann and D. G. Lowe, “Local naive Bayes nearest neighbor for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2012, pp. 3650–3656. [38] K. Yu, T. Zhang, and Y. Gong, “Nonlinear learning using local coordinate coding,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2009, pp. 2223–2231. [39] S. Lyu, “Mercer kernels for object recognition with local features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, vol. 2. Jun. 2005, pp. 223–229. [40] L.-J. Li and L. Fei-Fei, “What, where and who? Classifying events by scene and object recognition,” in Proc. IEEE Int. Conf. Comput. Vis., ICCV, Oct. 2007, pp. 1–8. [41] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2009, pp. 413–420. [42] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories,” Comput. Vis. Image Understand., vol. 106, no. 1, pp. 59–70, 2007. [43] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, 2011. [44] A. Shabou and H. LeBorgne, “Locality-constrained and spatially regularized coding for scene categorization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2012, pp. 3618–3625. [45] Z. Niu, G. Hua, X. Gao, and Q. Tian, “Context aware topic model for scene recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2012, pp. 2743–2750. [46] S. Gao, I. W. H. Tsang, and L. T. Chia, “Kernel sparse representation for image classification and face recognition,” in Proc. Eur. Conf. Comput. Vis., ECCV, 2010, pp. 1–14.

3253

[47] M. Dixit, N. Rasiwasia, and N. Vasconcelos, “Adapted Gaussian models for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2011, pp. 937–943. [48] S. Gao, I. W. Tsang, L.-T. Chia, and P. Zhao, “Local features are not lonely—Laplacian sparse coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2010, pp. 3555–3561. [49] P. Wang, J. Wang, G. Zeng, W. Xu, H. Zha, and S. Li, “Supervised kernel descriptors for visual recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, Jun. 2013, pp. 2858–2865. [50] J. Wu and J. M. Rehg, “CENTRIST: A visual descriptor for scene categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1489–1501, Aug. 2011.

Lei Zhang (M’09) is a Professor of Computer Science with the College of Information and Communication Engineering, Harbin Engineering University, Harbin, China. Her research interests include signal/image processing, computer vision, and machine learning.

Xiantong Zhen received the B.S. and M.E. degrees from Lanzhou University, Lanzhou, China, in 2007 and 2010, respectively, and the Ph.D. degree from the Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, U.K., in 2013. He is currently a Post-Doctoral Fellow with the University of Western Ontario, London, ON, Canada. His research interests include computer vision, machine learning, and medical image analysis.

Ling Shao (M’09–SM’10) received the B.Eng. degree in electronic and information engineering from the University of Science and Technology of China, Hefei, China, the M.Sc. degree in medical image analysis and the Ph.D. degree in computer vision from the Robotics Research Group, University of Oxford, Oxford, U.K. He is currently a Senior Lecturer (Associate Professor) with the Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, U.K. Before joining the University of Sheffield, he was a Senior Scientist with Philips Research, Eindhoven, The Netherlands. His research interests include computer vision, image/video processing, pattern recognition, and machine learning. He has authored and co-authored over 120 academic papers in refereed journals and conference proceedings, and holds over 10 European/U.S. patents. He is an Associate Editor of the IEEE T RANSACTIONS ON C YBERNETICS , Information Sciences, Neurocomputing, and several other journals, and has edited several special issues for the journals of the IEEE, Elsevier, and Springer. He is a fellow of the British Computer Society.

Learning object-to-class kernels for scene classification.

High-level image representations have drawn increasing attention in visual recognition, e.g., scene classification, since the invention of the object ...
3MB Sizes 2 Downloads 3 Views