IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Learning a Probabilistic Topology Discovering Model for Scene Categorization Luming Zhang, Member, IEEE, Rongrong Ji, Senior Member, IEEE, Yingjie Xia, Member, IEEE, Ying Zhang, and Xuelong Li, Fellow, IEEE Abstract— A recent advance in scene categorization prefers a topological based modeling to capture the existence and relationships among different scene components. To that effect, local features are typically used to handle photographing variances such as occlusions and clutters. However, in many cases, the local features alone cannot well capture the scene semantics since they are extracted from tiny regions (e.g., 4 × 4 patches) within an image. In this paper, we mine a discriminative topology and a low-redundant topology from the local descriptors under a probabilistic perspective, which are further integrated into a boosting framework for scene categorization. In particular, by decomposing a scene image into basic components, a graphlet model is used to describe their spatial interactions. Accordingly, scene categorization is formulated as an intergraphlet matching problem. The above procedure is further accelerated by introducing a probabilistic based representative topology selection scheme that makes the pairwise graphlet comparison trackable despite their exponentially increasing volumes. The selected graphlets are highly discriminative and independent, characterizing the topological characteristics of scene images. A weak learner is subsequently trained for each topology, which are boosted together to jointly describe the scene image. In our experiment, the visualized graphlets demonstrate that the mined topological patterns are representative to scene categories, and our proposed method beats state-of-the-art models on five popular scene data sets. Index Terms— Boosting, discrimination, learning, probabilistic model, redundancy, topology.

I. I NTRODUCTION

S

CENE categorization is a key component in many computer vision tasks, e.g., robotics path planning, image annotation, video content analysis, and so on. The

Manuscript received November 10, 2013; revised June 22, 2014; accepted June 24, 2014. Date of publication September 4, 2014; date of current version July 15, 2015. This work was supported in part by the Singapore National Research Foundation through the International Research Centre, Singapore Funding Initiative, within the Interactive Digital Media Programme Office, in part by the Natural Science Foundation of China under Grant 61125106, Grant 61373076, and Grant 61002009, in part by the Fundamental Research Funds for the Central Universities under Grant 2013121026, and in part by the 985 Project, Xiamen University, Xiamen, China. (Corresponding author: R. Ji) L. Zhang and Y. Zhang are with the School of Computing, National University of Singapore, Singapore 119077. R. Ji is with the Department of Cognitive Science, School of Information Science and Engineering, Xiamen University, Xiamen 361000, China (e-mail: [email protected]). Y. Xia is with the Department of Computer Science, Zhejiang University, Hangzhou 310027, China. X. Li is with the State Key Laboratory of Transient Optics and Photonics, Center for Optical Imagery Analysis and Learning, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2347398

Fig. 1. Each spatial structure among scene components captures some discriminative cue and corresponds to a weak classifier, which are further integrated into a strong classifier by boosting.

challenge of scene categorization lies in the wide variety of potential scene configurations. In particular, it is very challenging to extract discriminative visual descriptors due to the large number of components within a scene and their various spatial configurations, making it hard to design a statistical based feature representation model, as shown in Fig. 1. A promising alternative approach is to discover the topological structures from the scene image, which is recently advocated by the state-of-the-art works in scene categorization [3], [6], [9]. Constructing the topological image structure involves modeling the spatial cooccurrence and interaction among image patches. Among them, there are two representative alternatives, i.e., graph matching based schemes and spatial pyramid based schemes. A. Graph Matching-Based Schemes Graph matching-based approaches depict an image using a graphical model: each vertex represents a scene component and spatially neighboring components are connected by an edge. Thereby, the complicated spatial configurations of each scene image can be interpreted as the topology of the graph. Harchaoui and Bach [6] proposed a kernel that captures the walk structure among image local patches, using a finite sequence of neighboring regions. Unfortunately,

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

ZHANG et al.: LEARNING A PROBABILISTIC TOPOLOGY DISCOVERING MODEL

the totter phenomenon [7] unavoidably brings noise to the walk structure, which therefore limits its discriminability. To obtain a better discriminative power, parameters are provided in [6] to tune the length of walk, resulting in a redundant collection of walk patterns. Keselman and Dickinson [8] presented an approach to acquire the lowest common abstraction (LCA) among a set of images, i.e., common subgraphs among a set of images corresponding graphs. However, this approach is data set dependent and cannot recognize new images. Further, the results of experiments conducted indicate that the resulting LCA becomes unstable when there are more than two input images. Demirci et al. [2] proposed an object recognition approach by formulating the object recognition as many-to-many feature matching between graphs corresponding to pairwise images. The limitation of Demirci et al. approach is the sensitivity to clutters and occlusions. Felzenszwalb and Huttenlocher [3] modeled the connection of different parts of an object as a spring, and the mismatching cost of object parts are defined accordingly. Based on this, the matching between objects is computed by minimizing the mismatching cost. However, this model relies heavily on an optimal background subtraction. In [9], the vertices of the graph represent both the known and unknown objects, and the prior knowledge is employed to predict the unknown objects using the known ones. However, only knowledge of spatially neighboring segments are exploited in [9]; their topology and relative displacement, another two important cues for semantic inference, are not considered. Zhang et al. [36] proposed a categorization framework that integrates connected subgraphs from pairwise aerial images into a kernel [5], [29], which are then fed into an support vector machine (SVM). It is noticeable that subgraphs with different topologies are combined into the kernel, thereby potentially decreasing its descriptiveness. Even worse, the subgraph’s discrimination and redundancy are evaluated according to a discriminative model. In practice, a probabilistic discrimination or redundancy measure is more accurate, as each topology is neither completely correlated nor noncorrelated with a category. Duchenne et al. [43] proposed a graph matching kernel for object categorization, the vertices of which correspond to a set of image grids. The edges reflect the grid structure and function as springs to preserve the geometric property of spatially neighboring grids. However, the grid cannot represent basic scene image components. Wang et al. [44] proposed using a hierarchical connection graph to detect and extract gable roofs from aerial imagery, based on a self-avoiding polygon (SAP) model. The SAP model is a deformable shape model that can represent gable roofs of various shapes and appearances. Porway et al. [45] proposed a hierarchical and contextual model for aerial image understanding, wherein different components in an aerial image (e.g., cars and roofs) are organized into hierarchical graphs whose appearances and configurations are determined by statistical constraints, such as relative position and relative scale. Lin et al. [46] presented an object categorization framework based on sketch graphs, wherein shape and structure cues are exploited. The key concept underlying this framework is a learnable AND–OR graph model that hierarchically combines the reconfigurability of a stochastic

1623

context free grammar. Lin et al. [47] further proposed a hierarchical generative model for recognizing compositional object categories. Specifically, objects are decomposed into different parts and the relationships between the parts are modeled by stochastic attribute graph grammars, which are embedded in an AND–OR graph for each compositional object category. Further, Zhang et al. [48] proposed to measure the similarity between aerial images by enumeratively matching their respective graphlets. Unfortunately, in practice, this strategy is computationally intractable when the number of basic aerial image components is large. B. Multilayer Spatial Pyramid-Based Schemes Although the graph-based image representations model image topological property explicitly, many of them are either designed for a specific data set or heavily rely on an optimal preprocessing. Toward a general and robust image representation, multilayer spatial pyramid is employed to incorporate rough geometric property into a categorization model. Hadjidemetriou et al. [11] proposed multiresolution histogram (MRH), wherein images with different resolutions are computed using a Gaussian filter toward the global image histogram of different levels. Then, the differences between image histograms of consecutive levels are computed, and they are further concatenated to describe the rough geometry of an image. Lazebnik et al. [10] developed a spatial pyramid matching (SPM) model by partitioning an image into increasingly fine grids and by computing histograms of local features inside each grid cell. However, experimental results show that good performance is achieved only when SPM works together with a nonlinear SVM. The nonlinear SVM makes SPM risking in a high computational complexity: O(N 2 ∼ N 3 ) in the training stage. Toward a more efficient categorization model, Yang et al. [32] proposed Sparse coding spatial pyramid matching (SC-SPM), which encodes image local descriptors by sparse coding [34]. Wang et al. [28] proposed local linear coding sparse pyramid matching (LLC-SPM), which improves conventional SPM by utilizing the locality constraints to encode each local descriptor. Further in sparse vector spatial pyramid matching (SV-SPM), Zhou et al. [33] proposed a so-called super-vector encoding of SIFT descriptors, which extends the traditional vector quantization coding. Then, the encoded SIFT descriptors are concatenated into an image-level representation. The experimental results show that approaches of Yang et al. [32] and Wang et al. [28] perform well under a linear SVM. It is worth emphasizing that the aforementioned SPM models rely completely on low-level cues and reflect no semantics. To integrate high-level cues, Li et al. [35] proposed object-bank-based SPM (OB-SPM), wherein an image is described by a scale-invariant response map of a large number of prespecified generic object detectors, and excellent visual recognition performance is achieved. It is worth emphasizing the two limitations of the spatial pyramid-based approaches. 1) A single grid is not discriminative enough. Some grids, which may be noisy due to occlusion and variations of background, affect the recognition accuracy. 2) Spatial relations among grids are ignored.

1624

Fig. 2.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Pipeline of the proposed approach. (a) Graphlets from each scene image. (b) Obtain representative graphlets. (c) Graphlets boosting.

Besides the above two lines of research, Song and Tao [57] studied the manifold constructed by a set of biologically inspired features (BIFs) utilized for scene classification. The key is a new dimensionality reduction algorithm that preserves both the geometry of intra BIFs and the discriminative information inter BIFs. Huang et al. [58] solved two problems of conventional biologically inspired models: mismatch by dense input and randomly feature selection due to the feedforward framework. The authors developed an enhanced biologically inspired model that contains two components: 1) removing uninformative inputs by imposing sparsity constraints and 2) applying a feedback loop to middle level feature selection. Each component is motivated by relevant psychophysical research findings. Further, they applied the model to object categorization and conduct empirical studies on four computer vision data sets. Zhou and Tao [59] proposed double shrinking to compress image data on both dimensionality and cardinality by building either sparse low-dimensional representations or a sparse projection matrix for dimension reduction. They formulated a double shrinking model as regularized variance maximization. Liu and Tao [60] proposed multiview Hessian regularization to solve two problems in image annotation. That is, most of the existing methods are based on Laplacian regularization that suffers from the lack of extrapolating power, particularly when the number of labeled examples is small. In addition, conventional methods are often designed to cover single views, which are not applicable to the practical multiview applications. The proposed method combines both multiple kernels and Hessian regularizations obtained from different views to boost learning performance. To address the above problems, this paper focuses on mining the visual patterns from local regions in a probabilistic framework, which are further boosted together for scene categorization. As the pipeline shown in Fig. 2, first of all, each

scene image is segmented into a set of atomic regions, each representing a basic scene component. To model the spatial relations among these atomic regions, a graph called region adjacency graph (RAG) is constructed, and accordingly, scene categorization can be formulated as RAG-to-RAG matching [Fig. 2(a)]. To measure the similarity between RAGs, it is straightforward to compare all their respective small-sized connected graphs, also known as graphlets [40]. Unfortunately, the number of graphlets for an RAG is huge according to graph theory, making the enumerating of graphlet-to-graphlet comparison computationally intractable. Toward an efficient measure, it is necessary to select a few discriminative graphlets for comparison. As the number of candidate graphlet for selection is huge, aiming at fewer candidates, a probabilistic model is proposed to select a few highly discriminative and low redundant topologies, which are further used to extract the corresponding graphlets in Fig. 2(b). Intuitively, each topology captures the discrimination of a scene image from a probabilistic perspective (e.g., the star-shaped topology is representative for the Intersection bridge). To aggregate the discrimination from multiple topologies, we train multiple weak learners, each corresponds to a topology. These weak learners are further integrated using a boosting strategy for robustly scene categorization, as shown in in Fig. 2(c). The contributions of this paper can be summarized as: 1) a new scene categorization model that explicitly encodes scene topologies into a boosting framework to enhance recognition; 2) a probabilistic topology selection algorithm that obtains highly discriminative and low redundant graphlets to represent scene categories; and 3) a boosting strategy that combines weak learners from the selected topologies into a strong one. The remaining part of this paper is organized as follows. Section II introduces the proposed method, including prob-

ZHANG et al.: LEARNING A PROBABILISTIC TOPOLOGY DISCOVERING MODEL

Fig. 3.

1625

Scene image and its corresponding RAG.

abilistic topologies selection, depth-first-search (DFS)-based graphlet extraction, and topologies boosting. Experimental results in Section III thoroughly demonstrate the effectiveness and the efficiency of our model. Section IV concludes this paper and suggests future work.

Fig. 4. Illustration of the basic concepts in our model. (a) Original scene image. (b) Graphlet. (c) Topology.

II. P ROPOSED A PPROACH A. Basic Concepts of Topological Descriptors A scene image usually contains millions of pixels. If we treat each pixel as a local feature, the image matching will involve enumerating all potential combinations among pairwise local feature sets, which are computationally intractable. A more practical solution is to first segment images into geometric and semantically consistent regions, upon which the modeling and matching have to be built, ensuring high efficiency as well as consistency with human perception, i.e., pixels with similar appearances are perceived as one unit. Thus, we use RAG to represent a scene image by a set of segmented regions associated with their spatial interactions. In our approach, we choose unsupervised fuzzy clustering (UFC) to segment image. Toward a coarse-to-fine scene representation, each image is segmented five times. As shown in Fig. 3, given a scene image I , we segment it into a set of regions. An RAG G is constructed to model a scene image I , that is G = (V, E, H )

(1)

where V is a finite set of vertices, each representing a segmentation region; h : V → H is a function assigning a label to each vertex, i.e., h(v) is a row vector representing the appearance of the region corresponding to v. In this paper, the appearance is charlataneries by a combination of 128-D color moment [26] and 9-D histogram of gradient [27]. E is a set of edges, each connecting pairwise spatially neighboring vertices produced from the same segmentation. We introduce three important concepts related to the aforedefined RAG, which are all basic concepts in graph theory [37]. In Figs. 3 and 4, we call graph S a graphlet of RAG G if S is a connected subgraph of RAG G. For pairwise graphlets S and S , they are isomorphic (denoted by S ∼ = S ) if there exists a bijection ϕ : V → V such that for each u, v ∈ V ,(u, v) ∈ E if (ϕ(u), ϕ(v)) ∈ E and h(u) = h(ϕ(u )). If S ∼ = S and S ⊆ G , we call S subgraph isomorphic to G or G supergraph isomorphic to S (denoted by S G ). B. Mining Representative Topologies Based on the concept of RAG, the similarity between pairwise scene images can be intuitively formulated as the

Fig. 5. Many-to-one mapping from graphlets to their corresponding topology, i.e., g(S) = T . The four graphlets (right) all satisfy the topology (left).

similarity between their corresponding RAGs. To measure the similarity between RAGs, knowledge of graph theory [37] suggests to enumeratively compare all their pairwise graphlets. However, enumerative comparison is intractable because of two reasons. First, the number of graphlets from an RAG is huge, i.e., O(A M ), A is the average vertex degree (typically 5) and M is the number of segmented regions in an image (typically larger than 50). Second, nondiscriminative and redundant graphlets make no contribution to scene categorization. It is therefore necessary to select a fraction of highly discriminative and low-redundant graphlets in our scene categorization framework. To select highly discriminative and low redundant graphlets, a probabilistic feature selection algorithm is developed that contains three stages: 1) to reduce the number of candidates for selection, we shrink the graphlet space to topology space; 2) highly discriminative and low redundant topologies are selected first; and 3) the top-selected topologies guide the subsequent graphlet selection. As aforementioned, we shrink the large number of graphlets to relatively small number of topologies, to reduce the number of candidate graphlets for selection. In this paper, the topologies are obtained by discretizing the continuous graphlet vertices. As shown in Fig. 5, to map a graphlet to a topology, a clustering strategy is adopted. That is, a codebook H D is generated by k-means [14], [20], [21] clustering on the vertex labels. Then, the continuous label h(v) of vertex v is discretized into the nearest code h D (v) = arg min ||h(v) − h||2 . h∈H D

(2)

Based on the discretizing operation, given a graphlet S,

1626

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

its corresponding topology can be deemed as a graphlet with descretized vertices T = {V, E, H D }.

(3)

In this paper, we denote the mapping from a graphlet S to its topology T as a function g : S → T . As shown in Fig. 5, a topology can be deemed as a vertex-discretized graphlet, which means that the number of candidate topologies for selection is much smaller than that of candidate graphlets. Thus, it is feasible to select a few representative topologies for scene categorization. Before selecting the representative topologies, we need to measure a topology’s discrimination, i.e., how accurate of a topology predicting the class labels of scene images. As a vertexdiscretized graphlet, topologies describe the spatial relations of local features and there are usually multiple graphlets from a scene images satisfy topology T . Therefore, to derive the discrimination of a topology, it is necessary to find graphlets in G satisfying topology T , and in this paper these graphlets are denoted by G(T ) G(T ) = {S|S ⊆ G ∧ g(S) = T }.

(4)

Intuitively, each graphlet S ∈ G(T ) can be represented as a vector h(S) that is obtained by concatenating the appearance feature vectors of all its constituent vertices h(S) = ∪v∈S [h(v)]

(5)

where ∪[·] is a row-wise vector concatenation operator and the appearance vector is a 9-D color moment [26] plus a 128-D HOG [27]. The discrimination of a topology reflects how confident a topology can predict the label of a scene image. Given a set of RAGs G = {G 1 , . . . , G N } obtained from the training images and a topology T , we obtain a set of graphlets and further transform into feature vectors. Thereafter, a SVM classifier [14] C is trained based on {H, K}, where H is the appearance feature vector from training data, and K is the set of class labels corresponding to RAGs in G. Thereafter, the class label of graphlet S is calculated as the posterior probability P(G → k|S) output from C 1 S → arg max P(G → k|S). k

(6)

As shown in Fig. 5, typically there are more than one graphlets in G satisfying topology T . To combine the class labels predicted by all these graphlets, the label of topology T is derived from a multiple classifiers combining strategy [23]. That is, the posterior probability for topology T belonging to class k is calculated as P(G → k|G(T )) = (1 − Z )P(G → k) +

Z

P(G → k|Si )

(7)

i=1 1 As a scene image and its RAG are one-to-one, we do not discriminate them for ease of expression. Thus, P(G → k) means the probability of RAG G s corresponding scene image belonging to the kth category.

where Z is the number of graphlets satisfying topology T , and P(G → k) is the probability of RAG G belonging to class k, which is computed from the training data. Based on the above derivation, the confidence of topology T predicting its class label is the maximum probability of it belonging to class k c f (T ) = max P(G → k|G(T )). k

(8)

C. Highly Discriminative and Low Redundant Topologies In the extreme case, a topology T is optimal if ∃k ∈ {1, 2, . . . , K }, the following three conditions are satisfied: C1 :

P(G(T ) → k|G → k) = 1

C2 :

P(G → k|G(T ) → k) = 1

C3 :

P(G(T ) → k|G(T ) → k) = 0

where G → k means a scene image (or its RAG) belonging to the kth class, and G(T ) → k means topology T belonging to the kth class. Therefore, C1 maximize the descriptive ability of topology T , C2 maximize the discriminative ability of topology T , and C3 means pairwise topologies T and T are noncorrelated in predicting class labels. However, as proved in [14], in the case of noisy training data, such optimal topology may not always exist. Therefore, it is necessary to search for a set of suboptimal topologies, i.e., ∃k ∈ {1, 2, . . . , K }, such that C4 : C5 :

P(G(T ) → k) ≥ min(P(G → k)) P(G → k|G(T ) → k) ≥ α ∗ P(G → k)

C6 :

P(G(T ) → k|G(T ) → k) < β.

Condition C4 means the frequency of topology T belonging to class k should be larger than min(P(G → k)). To satisfy this requirement, the frequency of topology T is computed by counting how many RAGs in G are supergraph isomorphic [30], [37] to topology T |T G ∧ G ∈ G| . (9) N Straightforwardly, the frequency of a topology belonging to class k ∈ {1, 2, . . . , K } is computed by P(G(T )) =

|T G ∧ G ∈ G ∧ G → k| . (10) N For a topology T , a larger P(G(T ) → k) means T has a higher generalization ability toward class k. In our approach, an efficient frequent subgraph mining algorithm (FSG) [22], is employed to output topologies whose P(G(T ) → k) ≥ min(P(G → k)). The key advantage of FSG is that, it scales reasonably well to very large graph data sets provided that the graphs contains a sufficiently many different labels of edges and vertices. In addition, the optimized Linux program is publicly available.2 In our experiment, we found that FSG typically consumes 8.17 min to discover all the frequent toplogies, when we experiment on the more than 4000 20-sized RAGs on Scene 15 [10]. P(G(T ) → k) =

2 http://glaros.dtc.umn.edu/gkhome/pafi/overview

ZHANG et al.: LEARNING A PROBABILISTIC TOPOLOGY DISCOVERING MODEL

Fig. 6.

1627

Graphical illustration of extracting graphlets based on each of the mined topologies.

To satisfy C5 , inspired by linear discriminant analysis [12], the discrimination of topology T is defined as the largest discrimination toward class k ∈ {1, 2, . . . , K } P(G → k|G(T )) (11) di sc(T ) = max k P(G → k) where denominator reflects how many RAGs are from the kth class and functions as the normalization factor; the numerator is computed based on (7). Topology whose di sc(T ) < α is regarded as a less discriminative one. To satisfy rule C6 , given a pair of topologies T and T , their redundancy is computed as r edn(T, T ) = P(G(T ) → k|G(T ) → k)

(12)

where the probability P(G(T ) → k|G(T ) → k) is computed by counting the number of RAGs that are both supergraph isomorphic to T and T P(G(T ) → k|G(T ) → k) |G|T G ∧ T G ∧ G ∈ G ∧ G → k| . (13) = N By summarizing the three conditions C4 , C5 , and C6 , we present the algorithm of the highly discriminative and low redundant topologies selection in Algorithm 1. Given N the number of training RAGs and A the number of candidate topologies for selection, we assume that the structure distance between pairwise RAGs can be computed in constant time. Since distance between each pair of RAGs is required for calculating di sc and r edu, the computational cost of calculating di sc and r edu are both O(N 2 ). As shown in Algorithm 1, the topology selection algorithm contains a double recursion and time complexity of each is O(AN 2 ). Therefore, the time complexity of Algorithm 1 is O(N 2 A2 ) in total. D. DFS-Based Graphlets Extraction As the selected topologies are both highly discriminative and have low redundancy, the next step is to select corresponding graphlets from them. In this paper, a DFS-based algorithm is developed to find a topology’s corresponding graphlets in an RAG. The advantage of DFS is that it allows to nonrepetitively visit all vertices in an RAG with linear time consumption, as the average RAG vertex degree is small. As shown in Fig. 6, the proposed graphlet extraction algorithm contains three steps. First, we check whether topology T contains fewer vertices than RAG G. If yes, an iterative process will be carried out to extract the corresponding graphlets. Otherwise,

Algorithm 1 Highly Discriminative and Low Redundant Topologies Selection N input: Training data D = {Ii , ki }i=1 ; Threshold α, β; output: A set of refined topologies L; 1. For each scene image Ii in D, obtain the corresponding RAG and save them into G; 2. Conduct FSG on G to preserve topologies T with P(G(T ) → k) ≥ min(P(G → k)) into L; 3. for each topology T ∈ L if di sc(T ) < α, then L ← L \ T ; end for; 4. For each pair of T, T ∈ L, compute r edn(T, T ) basedon (12), and if(r edn(T, T ) > β), then remove the topology with smaller di sc value. Return L;

the algorithm will terminate. Second, for each vertex in RAG G, we treat it as the reference point and match S to the topology in RAG G. A DFS strategy is employed to make the matching. Only graphlets with the same topology to S are qualified. By traversing all vertices in G, we perform the matching process and collect the qualified graphlets iteratively. Finally, a collection of graphlets are obtained by the proposed algorithm. Noticeably, RAGs are graphs with low vertex degree, thus the computational cost is approximately linearly increasing with the number of its vertices. E. Boosting the Selected Topologies To integrate the extracted highly discriminative and low redundant topologies for scene categorization, a boosting strategy is developed. In detail, for each topology T ∈ L, a weak linear SVM classifier C is trained that captures the discrimination of scene spatial configuration in one aspect. To further combine the discrimination from all aspects, a multiclass boosting algorithm to integrate the λ (λ = |L|) weak classifiers {Ci }λi=1 into a strong one C. Inspired by the stagewise additive modeling using a multiclass exponential loss (SAMME) [25], which performs K -class (K ≥ 2) boosting robustly, we develop a multiclass boosting algorithm to integrate the λ weak classifiers {Ci }λi=1 into a strong one C. The key advantage of SAMME is that compared with AdaBoost.MH [25] that performs K one-against-all classifications, SAMME performs K -class classification

1628

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

TABLE I S TATISTICS OF THE E XPERIMENTAL D ATA S ETS

Algorithm 2 Discriminative Graphlets Boosting input: A set of training RAGs and their corresponding |L| labels: {G j , k j } Nj=1 ; A set of weak classifiers {Ci }i=1 ; Iteration number of boosting R; output: A strong classifier: C(G); 1. Set the RAG weights w j = N1 , j = 1, 2, · · · , N; 2. for t = 1, 2, · · · , R (a) Select a weak classifier C (t ) from {Ci }: arg minC (t) ∈{Ci } Nj=1 w j · (G j (Ti ) k); (b) Compute weighted training error: err t =

N

j =1 w j ·

(G j (Ti )k)

N

j =1 w j

;

(1−err t ) err t

(c) a t ← log + log(K − 1); (d) Update the trainingRAG weight: w j ← w j · exp[a t · (G j (T ) k)]; (e) Renormalize w j ; end for; Return C(G) = arg maxk tT=1 a t · (G(Ti ) → k);

step 1 is with time complexity of O(N); step 2(a) is with time complexity of O(|L|) and the remaining steps 2(b)–(e) are all with time complexity of O(1); and step 3 is with time complexity of O(T ). Therefore, the time complexity of Algorithm 2 is O(R|L|) in total. III. E XPERIMENTAL R ESULTS AND A NALYSIS In this section, we justify the effectiveness of the proposed graphlet-guided scene categorization model based on two sets of experiments. The first set of experiments compares the proposed approach with well-known object/scene classification models. The second set of experiments evaluate the performance of our approach under different parameters. To demonstrate the advantage of the proposed method, we experiment on five data sets: Scene 15 [10], Scene 67 [24], Caltech 256 [15], PASCAL VOC 2009 [16], and LHI [17]. Table I shows the details of the five data sets. Some sample images are given in Fig. 7. A. Comparison With the State-of-the-Art

directly. That is, SAMME only needs weak classifiers better than random guess, i.e., correct probability larger than 1/K , rather than better than 1/2 as the binary AdaBoost requires. tr In particular, given the training data {h j , k j } Nj =1 , wherein h j is the j th training graphlet and k j is the semantic label of h j , SAMME tries to find a regression function C(h) = [c1 (h), . . . , c K (h)]T to minimize the objective function Ntr

L(yi , C(h)) s.t. ci + · · · + ck = 0

(14)

i=1

where y = [y1 , . . . , yk ]T is a K -dimensional feature vector corresponds to the output of C 1 if C(h) = k (15) yk = − K 1−1 if C(h) = k where L(yi , C(h)) = exp(−1/K (y1c1 + · · · + y K c K )) = exp((−1/K )yT C) is the loss function, and we set c1 + · · · + c K = 0 to guarantee the unique solution. The flowchart of the topology-based boosting is presented in Algorithm 2. In the tth iteration, the classifier from λ classifiers {Ci }λi=1 with minimum error is selected. In addition, the term log(K − 1) is used to make the accuracy of each weak classifier higher than 1/K , because to make a t > 0, we need err t < 1 − 1/K . Given |L| the number of weak classifiers, N the number of training RAGs, T the number of selected topologies, and R the iteration number of boosting,

We compare the proposed method with five representative generic object recognition models: fixed length walk kernel (FLWK) [6], fixed length tree kernel (FLTK) [6], MRH [11], SPM kernel (SPMK) [10], and region-based hierarchical image matching (RHIM) [1]. In addition, four SPM variants: LLC-SPM, SV-SPM, SC-SPM, and OB-SPM are also compared. The experimental settings are as follows: the lengths of FLWK [6] and FLTK [6] are tuned from 2 to 10; for MRH [11], we smooth images with Radius Basis Function kernels of 15 gray levels; for SPMK [10] and SPM variants, each image is decomposed into over 1 million SIFT [13] features of 16 × 16 pixel patches computed over a grid with spacing of 8 pixels, then a codebook of size 400 is generated by k-means [14]; and for the proposed method, the times of multiple segmentations max(L), is tuned from 2 to 7, and the iteration number of boosting R, is set to 300. In Fig. 8, we compare the average recognition accuracy of the six compared methods, and the per category classification accuracy on PASCAL VOC 2009 is given in Table IV. Our approach outperforms its competitors owing to the following reasons. 1) Since SPMK, SC-SPM, SV-SPM, and LLC-SPM are based on SIFT descriptors only, it is difficult to simultaneously incorporate multiple types of low-level visual descriptors. In this way, they ignore the other important cues such as color distribution.

ZHANG et al.: LEARNING A PROBABILISTIC TOPOLOGY DISCOVERING MODEL

Fig. 7.

1629

Example images from the five experimental data sets.

2) The object detectors in OB-SPM are trained for generic objects, they cannot be effectively detecting basic aerial image components. 3) All the five SPM-based recognition models incorporate no discriminative spatial information. 4) Revised Hierarchical Model and RHIM only capture rough structural information of an image, and thus it performs quite poor. 5) Compared to the proposed approach, both walk and tree kernel are less representative graph-based descriptors, due to the inherent totter phenomenon [7]. We further compare our approach with more categorization models in recent years. Dixit et al. [52] presented a general formulation of Bayesian adaptation, which targets class adaptation and it is applicable to both the generative and the discriminative strategies for the problem of image classification. Kobayashi [53] proposed a novel method to extract effective features for image classification. In the framework of Bag of Feature that extracts a plenty of local descriptors from an image, the proposed method is built upon the probability density function (pdf) obtained by applying kernel density estimator to those local descriptors. The method exploits the oriented pdf gradients to effectively characterize the pdf, which are subsequently coded and aggregated into the orientation histograms. Zhang et al. [54] proposed a discriminative structured low-rank framework for image classification. Label information form training data is incorporated into the dictionary learning process by adding an idea-code regularization term to objective function of dictionary learning. Russakovsky et al. [55] proposed an object-centric spatial pooling framework for improving classification performance. The framework focuses on training reliable object detectors with no available bounding box annotations as in typical classification settings. Further, Wu and Rehg [56] proposed a census transform histogram (CENTRIST) for scene recog-

nition. They analyzed the peculiarity of place images and listed a few properties that are desirable for a place/scene recognition representation. Afterward, they focused on exhibiting how CENTRIST satisfy these properties better than competing visual descriptors, like SIFT, HOG, and GIST. They also demonstrated that CENTRIST has several advantages in comparison with state-of-the-art feature descriptors for place/scene categorization. In Table II, we compare the proposed method with the five state-of-the-art categorization models described above. We follow the experimental settings described in their publication. As can be seen, on all the four experimental data sets, our approach outperforms its competitors in most cases. When only 20 topologies are selected, the performance of our approach is worse than those published very recently. To demonstrate the potential of the proposed model, we selected 50 topologies for each data set, and present the comparative categorization accuracies in Table III. We compare our approach with models proposed in [49]–[51], respectively. As shown in Table III, when selecting 50 discriminative topologies, our approach outperforms the three competitors on the Scene 15 data set. On the Scene 67 data set, our approach lags behind Juneja et al.’s [15] approach by nearly 2%, but still outperforms Krapac et al.’s [49] and Xiao et al.’s [50] results. The above results clearly conform the advantage of the proposed categorization model. And we believe that when more topologies are selected (e.g., 100 or 150), the categorization accuracy of our model will become more and more close to Juneja et al.’s [51] performance on the Scene 67 data set. B. Detailed Analysis of Categorization Performance We first present the per-category accuracy on the PASCAL VOC 09 data set. The codebook sizes of Local Linear Cod-

1630

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

TABLE II C OMPARISON OF S TATE - OF - THE -A RT C ATEGORIZATION A LGORITHMS ON THE F OUR D ATA S ETS

TABLE III C OMPARATIVE C ATEGORIZATION A CCURACIES ON THE S CENE 15 AND THE

S CENE 67 ( THE R ED C OLORED P ERCENTAGES A RE C OMPUTED BASED ON O UR I MPLEMENTATION ; THE B LUE C OLORED P ERCENTAGES A RE R ESULTS P RESENTED IN T HIS PAPER )

Fig. 9. Categorization accuracy under different codebook sizes or number of selected topologies.

Fig. 8. Categorization accuracy of the compared methods on, from top to bottom, Scene 15, Scene 67, Caltech 256, and PASCAL VOC 2009, wherein the experiment on each data set is repeated 10 times.

ing (LLC), Sparse Vector (SV), and Sparse Coding (SC), are all fixed to 128. The number of object detectors in OB-SPM is 200 as described in the publication. The number of selected topologies is 20 for our model. As shown in Table V, our approach beats its competitors on most categories, which clearly confirms the advantage of the proposed method. Then, we show the influence of codebook size/the number of selected topologies on the categorization performance. We experiment on the LLC-SPM, SV-SPM, SC-SPM, and

the proposed method. As shown in Fig. 9, for LLC-SPM, SV-SPM, and SC-SPM, the categorization performance increases stably when their codebook sizes are tuned from 128 to 2048. However, for the proposed method, the categorization accuracy stops increasing when the number of selected topologies reaches 256. This is because there are only 274 topologies on the PASCAL VOC 09 in total. Finally, we report the time consumption of codebook training and encoding, respectively. As shown in Table VI, both the codebook training and encoding time consumption is tolerable, compared with the other three algorithms. It is worth emphasizing that less time consumption can be achieved if we optimize the MATLAB codes in our implementation. C. Influence of Different Parameters For the proposed method, we notice that the influence of segmentation operation in the construction of RAG is nonnegligible. To evaluate scene categorization under different segmentation settings, based on (11), we report the frequent topology’s measure of discrimination under benchmark segmentation, deficient segmentation, and oversegmentation (output of step 2 of Algorithm 1). We experiment on the

ZHANG et al.: LEARNING A PROBABILISTIC TOPOLOGY DISCOVERING MODEL

1631

TABLE IV P ER -C ATEGORY A CCURACY OF THE 20 C ATEGORIES OF PASCAL VOC 2009 (%)

TABLE V C OMPARISON OF P ER -C ATEGORY A CCURACY ON THE PASCAL VOC 09 D ATA S ET (%)

TABLE VI T IME C ONSUMPTION OF C ODEBOOK T RAINING AND E NCODING (F OR THE P ROPOSED M ETHOD , C ODEBOOK T RAINING M EANS T OPOLOGIES S ELECTION , AND THE E NCODING D ENOTES C ALCULATING THE I MAGE R EPRESENTATION BASED ON THE S ELECTED T OPOLOGIES )

Fig. 10.

Topology’s disc value under three different segmentation settings.

PASCAL VOC 2009 [16] because its segmentation benchmark is beneficial to make a precise comparison. As shown in Fig. 10, topologies from benchmarksegmentation achieve the highest discrimination, with the highest di sc value of 58, followed by the oversegmenta-

tion 53 and deficient segmentation 49. The explanations are as follows: 1) the benchmark segmentation is obtained by manually annotation, which encodes the high-level semantic understanding, and thus, it is unavoidable that UFC [18] may be less accurate than the benchmark segmentation and 2) in contrast to deficient segmentation, more regions are obtained in oversegmentation setting, so it is rarer for one region that spans several components, in which fewer discriminative components are neglected. In addition to the segmentation, the iteration number of boosting R is also an important parameter in our categorization model. Here, we report the error rate under different iteration numbers of boosting. In contrast, we also report the error rate by boosting the global image feature proposed in SPMK [10] (the second best performer on Scene 15, as observed from Fig. 8). As shown in Fig. 12, for both the training and the test stage, boosting the graphlet-based weak SVM classifier performs better than boosting global image feature in SPMK.

1632

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

TABLE VII C ATEGORIZATION A CCURACIES U NDER D IFFERENT N UMBER OF T RAINING I MAGES ON THE C ALTECH -256

Fig. 11. Categorization accuracies under different number of k-means centers.

Furthermore, the performance becomes stable when the iteration number is larger than 300. Thus, we set R to 300 in our implementation. Next, we evaluate the categorization performance of our approach under different training/test splits. As shown in Table VII, we present categorization performance when different training images (i.e., 15, 30, 45, 60) are employed on the Caltech-256 data set. In Fig. 11, we present the categorization when different numbers of k-means centers are applied on the PSACAL VOC 09. As can be seen, neither a small number of k-means centers nor a larger number of k-means centers are optimal. In our implementation, we use 32 k-means centers on this data set. Finally, the threshold of α and the threshold of β are two important parameters to be concerned. In our implementation, we set α to {0.2, 0.4, 0.6, 0.8} and obtain a set of highly discriminative topologies. Then, we tune β to obtain different number of highly discriminative and low redundant topologies for scene categorization. Categorization accuracy with different value of β are given in Fig. 13. IV. C ONCLUSION Scene categorization is a hot research topic in computer vision and machine learning [4], [19], [31], [39], [41]. This paper presents a new scene categorization model, focusing on mining the representative topological visual patterns from scene images. An RAG is constructed to encode the structure of each scene image, and highly discriminative and low redundant graphlets are extracted from the RAGs subsequently. Thereafter, these selected high quality graphlets are integrated into a boosting framework for scene categorization. Extensive experimental results on five popular image data sets demonstrate the effectiveness of the proposed method.

Fig. 12. Performance of the proposed method under different iteration number of boosting (TB: topological boosting, DB: directly boosting global features).

Fig. 13.

Scene categorization performance under different value of β.

In the future, inspired by the human active viewing process, an active-learning-based graphlet selection mechanism is planned to obtain representative graphlets in an image. In addition, the proposed graphlet selection scheme is analogous to the receptive field learning algorithm [38], and we are trying to develop a model to automatically learn a topological receptive field. R EFERENCES [1] S. Todorovic and N. Ahuja, “Region-based hierarchical image matching,” Int. J. Comput. Vis., vol. 78, no. 1, pp. 47–66, Jun. 2008. [2] M. F. Demirci, A. Shokoufandeh, Y. Keselman, L. Bretzner, and S. Dickinson, “Object recognition as many-to-many feature matching,” Int. J. Comput. Vis., vol. 69, no. 2, pp. 203–222, Jul. 2006. [3] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” Int. J. Comput. Vis., vol. 61, no. 1, pp. 55–79, Jan. 2005. [4] J. Li and D. Tao, “Simple exponential family PCA,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 485–497, Mar. 2013. [5] X. Xu, I. W. Tsang, and D. Xu, “Soft margin multiple kernel learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 749–761, May 2013.

ZHANG et al.: LEARNING A PROBABILISTIC TOPOLOGY DISCOVERING MODEL

[6] Z. Harchaoui and F. Bach, “Image classification with segmentation graph kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1–8. [7] N. Shervashidze, S. V. N. Vishwanathan, T. H. Petri, K. Mehlhorn, and K. M. Borgwardt, “Efficient graphlet kernels for large graph comparison,” J. Mach. Learn. Res., vol. 5, pp. 488–495, Apr. 2009. [8] Y. Keselman and S. Dickinson, “Generic model abstraction from examples,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 7, pp. 1141–1156, Jul. 2005. [9] Y. J. Lee and K. Grauman, “Object-graphs for context-aware category discovery,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 1–8. [10] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2006, pp. 2169–2178. [11] E. Hadjidemetriou, M. D. Grossberg, and S. K. Nayar, “Multiresolution histograms and their use for recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 7, pp. 831–847, Jul. 2004. [12] D. Tao, X. Li, X. Wu, and S. J. Maybank, “Geometric mean for subspace selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 260–274, Feb. 2009. [13] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Mar. 2004. [14] K. J. Cios, W. Pedrycz, and R. W. Swiniarski, Data Mining Methods for Knowledge Discovery. New York, NY, USA: Springer-Verlag, Aug. 1998. [15] G. Griffin, A. Holub, and P. Perona, “The Caltech 256,” Dept. Comput. Sci., California Inst. Technol., Pasadena, CA, USA, Tech. Rep., 2007. [16] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010. [17] B. Yao, X. Yang, and S.-C. Zhu, “Introduction to a large-scale general purpose ground truth database: Methodology, annotation tool and benchmarks,” in Proc. EMMCVPR, 2007, pp. 169–183. [18] X. Xiong and K. L. Chan, “Towards an unsupervised optimal fuzzy clustering algorithm for image database organization,” in Proc. IEEE Conf. Pattern Recognit., Sep. 2000, pp. 3909–3912. [19] X. Lu, Y. Wang, and Y. Yuan, “Sparse coding from a Bayesian perspective,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 929–939, Apr. 2013. [20] Y. Han, W. Lu, and T. Chen, “Cluster consensus in discrete-time networks of multiagents with inter-cluster nonidentical inputs,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 4, pp. 566–578, Feb. 2013. [21] L. Jing, M. K. Ng, and T. Zeng, “Dictionary learning-based subspace structure identification in spectral clustering,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 8, pp. 1188–1199, Aug. 2013. [22] M. Kuramochi and G. Karypis, “An efficient algorithm for discovering frequent subgraphs,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 9, pp. 1038–1051, Sep. 2004. [23] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998. [24] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1–8. [25] J. Zhu, H. Zou, S. Rosset, and T. Hastie, “Multi-class AdaBoost,” Statist. Inter., vol. 2, no. 3, pp. 349–360, 2009. [26] M. Stricker and M. Orengo, “Similarity of color images,” Proc. SPIE, vol. 2420, pp. 381–392, Mar. 1995. [27] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2005, pp. 886–893. [28] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3360–3367. [29] M. Gori and S. Melacci, “Constraint verification with kernel machines,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 825–831, May 2013. [30] Y. Pang, Z. Ji, P. Jing, and X. Li, “Ranking graph embedding for learning to rerank,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 8, pp. 1292–1303, Jul. 2013. [31] G. Orchard, J. G. Martin, R. J. Vogelstein, and R. Etienne-Cummings, “Fast neuromimetic object recognition using FPGA outperforms GPU implementations,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 8, pp. 1239–1251, Aug. 2013. [32] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 2169–2178.

1633

[33] X. Zhou, K. Yu, T. Zhang, and T. S. Huang, “Image classification using super-vector coding of local image descriptors,” in Proc. Eur. Conf. Comput. Vis., Sep. 2010, pp. 141–154. [34] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2006. [35] L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei, “Object bank: A highlevel image representation for scene classification and semantic feature sparsification,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2010, pp. 1378–1386. [36] L. Zhang, Y. Han, Y. Yang, M. Song, S. Yan, and Q. Tian, “Discovering discriminative graphlets for aerial image categories recognition,” IEEE Trans. Image Process., vol. 22, no. 12, pp. 5071–5084, Dec. 2013. [37] A. K. Hartmann and M. Weigt, Phase Transitions in Combinatorial Optimization Problems: Basics, Algorithms and Statistical Mechanics. New York, NY, USA: Wiley, Sep. 2005. [38] Y. Jia, C. Huang, and T. Darrell, “Beyond spatial pyramids: Receptive field learning for pooled image features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 3370–3377. [39] B. Zou, L. Li, Z. Xu, T. Luo, and Y. Y. Tang, “Generalization performance of Fisher linear discriminant based on Markov sampling,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 2, pp. 288–300, Feb. 2013. [40] L. Zhang, M. Song, Q. Zhao, X. Liu, J. Bu, and C. Chen, “Probabilistic graphlet transfer for photo cropping,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 802–815, Feb. 2013. [41] N. Gkalelis, V. Mezaris, I. Kompatsiaris, and T. Stathaki, “Mixture subclass discriminant analysis link to restricted Gaussian model and other generalizations,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 8–21, Jan. 2012. [42] F. Nie, Z. Zeng, I. W. Tsang, D. Xu, and C. Zhang, “Spectral embedded clustering: A framework for in-sample and out-of-sample spectral clustering,” IEEE Trans. Neural Netw. Learn. Syst., vol. 22, no. 11, pp. 1796–1808, Nov. 2011. [43] O. Duchenne, A. Joulin, and J. Ponce, “A graph-matching kernel for object categorization,” in Proc. IEEE Conf. Comput. Vis., Nov. 2011, pp. 1–8. [44] Q. Wang, Z. Jiang, J. Yang, D. Zhao, and Z. Shi, “A hierarchical connection graph algorithm for gable-roof detection in aerial image,” IEEE Geosci. Remote Sens. Lett., vol. 8, no. 1, pp. 177–181, Jan. 2011. [45] J. Porway, K. Wang, and S. C. Zhu, “A hierarchical and contextual model for aerial image understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [46] L. Lin, X. Liu, S. Peng, H. Chao, Y. Wang, and B. Jiang, “Object categorization with sketch representation and generalized samples,” Pattern Recognit., vol. 45, no. 10, pp. 3648–3660, Oct. 2012. [47] L. Lin, T. Wu, J. Porway, and Z. Xu, “A stochastic graph grammar for compositional object representation and recognition,” Pattern Recognit., vol. 42, no. 7, pp. 1297–1307, Jul. 2009. [48] L. Zhang et al., “Spatial graphlet matching kernel for recognizing aerial image categories,” in Proc. IEEE Conf. Pattern Recognit., Nov. 2012, pp. 657–666. [49] J. Krapac, J. Verbeek, and F. Jurie, “Modeling spatial layout with fisher vectors for image categorization,” in Proc. IEEE Conf. Comput. Vis., Nov. 2011, pp. 1487–1494. [50] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3485–3492. [51] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman, “Blocks that shout: Distinctive parts for scene classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 923–930. [52] M. Dixit, N. Rasiwasia, and N. Vasconcelos, “Adapted Gaussian models for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 937–943. [53] T. Kobayashi, “BFO meets HOG: Feature extraction based on histograms of oriented p.d.f. gradients for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 947–954. [54] Y. Zhang, Z. Jiang, and L. S. Davis, “Learning structured low-rank representations for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 1–8. [55] O. Russakovsky, Y. Lin, K. Yu, and L. Fei-Fei, “Object-centric spatial pooling for image classification,” in Proc. Eur. Conf. Comput. Vis., Oct. 2012, pp. 1–15. [56] J. Wu and J. M. Rehg, “CENTRIST: A visual descriptor for scene categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1489–1501, Aug. 2011.

1634

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

[57] D. Song and D. Tao, “Biologically inspired feature manifold for scene classification,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 174–184, Jan. 2009. [58] Y. Huang, K. Huang, D. Tao, T. Tan, and X. Li, “Enhanced biologically inspired model for object recognition,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 6, pp. 1668–1680, Dec. 2011. [59] T. Zhou and D. Tao, “Double shrinking sparse dimension reduction,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 244–257, Jan. 2013. [60] W. Liu and D. Tao, “Multiview Hessian regularization for image annotation,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2676–2687, Jul. 2013.

Luming Zhang (M’14) received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China. He is currently a Post-Doctoral Research Fellow with the School of Computing, National University of Singapore, Singapore. His current research interests include multimedia analysis, image enhancement, and pattern recognition.

Rongrong Ji (SM’14) received the Ph.D. degree in computer science from the Harbin Institute of Technology, Harbin, China. He has been a Post-Doctoral Research Fellow with the Department of Electrical Engineering, Columbia University, New York, NY, USA, since 2011. He is currently a Full Professor with Xiamen University, Xiamen, China.

Yingjie Xia (M’12) received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China. He was a Visiting Student with the University of Illinois at Urbana-Champaign, Champaign, IL, USA, from 2008 to 2009. He held a post-doctoral position with the Department of Automation, Shanghai Jiao Tong University, Shanghai, China, from 2010 to 2012. He is currently an Associate Professor with the Hangzhou Institute of Service Engineering, Hangzhou Normal University, Hangzhou. His current research interests include multimedia analysis, pattern recognition, and intelligent transportation systems.

Ying Zhang received the Ph.D. degree from the School of Computing, National University of Singapore, Singapore. She is currently a Research Assistant with the Centre of Social Media Innovations for Communities, National University of Singapore. Her current research interests include location-based service, sensor-rich media analytics, multimedia system, computer vision, and machine learning.

Xuelong Li (M’02–SM’07–F’12) is currently a Full Professor with the Center for Optical Imagery Analysis and Learning, State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, China.