3412

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Decomposition and Extraction: A New Framework for Visual Classification Yuqiang Fang, Qiang Chen, Lin Sun, Bin Dai, and Shuicheng Yan, Senior Member, IEEE

Abstract— In this paper, we present a novel framework for visual classification based on hierarchical image decomposition and hybrid midlevel feature extraction. Unlike most midlevel feature learning methods, which focus on the process of coding or pooling, we emphasize that the mechanism of image composition also strongly influences the feature extraction. To effectively explore the image content for the feature extraction, we model a multiplicity feature representation mechanism through meaningful hierarchical image decomposition followed by a fusion step. In particularly, we first propose a new hierarchical image decomposition approach in which each image is decomposed into a series of hierarchical semantical components, i.e., the structure and texture images. Then, different feature extraction schemes can be adopted to match the decomposed structure and texture processes in a dissociative manner. Here, two schemes are explored to produce property related feature representations. One is based on a single-stage network over hand-crafted features and the other is based on a multistage network, which can learn features from raw pixels automatically. Finally, those multiple midlevel features are incorporated by solving a multiple kernel learning task. Extensive experiments are conducted on several challenging data sets for visual classification, and experimental results demonstrate the effectiveness of the proposed method. Index Terms— Image decomposition, visual classification, feature learning and sparse coding.

I. I NTRODUCTION A. Motivation Generic visual category classification has been one of the most challenging tasks in computer vision. Although remarkable progress has been made in the past few years, the classification of complex sematic categories, such as scenes or objects, is still an open problem. The main difficulty comes from the variances in the appearance of the scenes/objects Manuscript received August 2, 2013; revised February 27, 2014; accepted May 27, 2014. Date of publication June 12, 2014; date of current version July 1, 2014. This work was supported by the National Natural Science Foundation of China under Grant 61375050 and Grant 91220301. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Adrian G. Bors. Y. Fang and B. Dai are with the College of Mechatronic Engineering and Automation, National University of Defense Technology, Changsha 410073, P. R. China ( e-mail: [email protected]; [email protected] ). Q. Chen and S. Yan are with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117583 (e-mail: [email protected]; [email protected]). L. Sun is with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2330792

caused by the variations of their scales, positions and viewpoints, etc. How to build an abstract and invariant representation (also called signature or feature) for visual classification is the key task. In the recent decade, a growing body of research on visual category classification has focused on building powerful representations with a coding and pooling pipeline which is inspired by the architecture of the mammalian primary visual cortex [1]. Usually, the global representations, extracted from this pipeline, are referred to mid-level features which are built upon low-level descriptors without using high-level image sematic information [2]. To date, considerable efforts have been devoted to design better mid-level features for visual classification under the basic coding and pooling architecture. Examples of popular mid-level features include bag-offeatures [3], spatial pyramid matching [4]–[6], deep neural network [7], [8] and HMAX nets [9], [10], etc. We refer the reader to [2] and [11] for more state-of-the-art mid-level feature methods. Most of these methods have achieved the stateof-the-art performance on several challenging visual category recognition databases (e.g., 15-Scenes [3], Caltech-101 [12] and Caltech-256 [13]) and competitions (e.g., PASCAL VOC [14]). Existing methods usually combine low-level features [15], [16] or design a sophisticated coding and pooling mechanism [17]–[20]. One important common characteristic of these works is that they extract many types of low-level features directly from original images and then combine them together. However, we argue that this procedure is not ideal in the view of image modeling and visual perception. Firstly, focusing on the image itself, we find that the mechanism of image composition also strongly influences feature extraction. Specifically, the image patch is usually used as a basic unit in feature extraction, and based on its visual property, a patch can be described as a texture patch, a structure patch or a mixed patch. Ideally, the patches with different visual properties should be described by a property specific feature, but this process is ignored in most methods which take the unified feature description for non-decomposed patches instead. Secondly, the psychological evidence demonstrates that visual perceptual organization is not a monolithic entity but includes several different processes [21]. When human observers see an object, the visual stimuli are first processed in a dissociative manner on texture and structure information and then recombined in an integrative stage to recognize the object [21], [22]. Thus, it is desirable to design an object recognition system which can represent different visual cues such as structure

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

3413

transition [24]. As can be seen in Fig. 1(b), each pair of structure image and texture image (e.g. A1-B1, A2-B2, etc.) at a certain scale can represent the original image in a particular respect. Hence, it is necessary to manipulate decomposition at multiple scales. In our framework, we use a novel hierarchical image decomposition method to generate multi-scale structure and texture component images. As one can see in Fig. 1(b), the decomposed image pair in each layer corresponds to a specific scale. Thus, more decomposed visual cues will be obtained from multiple scales, which will facilitate the extraction and combination in our feature learning process. B. Overview of the Proposed Approach

Fig. 1. Illustration of the benefit of hierarchical image decomposition for feature learning. (a) Decomposed visual cues. (b) Decomposed visual cues at multiple scales.

or texture in decomposed processing streams. Motivated by this, we propose a novel framework for visual classification based on hierarchical image decomposition and hybrid midlevel feature extraction. We utilize hierarchical image decomposition to decouple structure and texture components into a series of property specific images, and then hybrid mid-level features are extracted from the decomposed images. Beyond the basic architecture, we address the following two issues in our method: 1) Representation via decomposition. In our method, we highlight the role of property specific image decomposition in the feature extraction pipeline. As in Fig. 1(a), both the material textures and contour edges are contained in patch A. Thus, a single feature descriptor e.g. SIFT [23] from patch A will mix the visual cues together. To eliminate the unexpected mixture, we suggest decomposing the image into two component images with different properties, namely structure image and texture image respectively. Clearly, the feature representation will benefit from the decomposition. As can be seen from Fig. 1(a), patch B and patch C from different decomposed images contain more property specific information. Then, a property specific feature extraction process can be performed based on the image decomposition. 2) Representation via hierarchical decomposition. Actually, both texture and structure are scale-dependent concepts. For example, whatever is interpreted as texture at a given scale will consist of significant structure when viewed at a refined scale (typically, the concept of “scale” is often defined with a model related parameter). This phenomenon can be explained as a perceptual

The proposed framework aims to extend the basic architecture of visual classification. Instead of learning features directly from the original images, we apply the hierarchical image decomposition as preprocessing, and then dissociative extraction operation and adaptive integration are implemented in the later processing stages. An overview of our proposed framework is shown in Fig. 2. Clearly, the proposed method consists of three stages. In the first stage, we decompose the input image into a series of structure images and texture images. This process is implemented with a novel hierarchical image decomposition algorithm as introduced in Section II. In the second stage, we compute mid-level features with respect to the decomposed images. Specifically, we employ two schemes, one based on handcrafted features and the other emphasizing learning features from raw pixels, to explore the property specific mid-level features. The role of decomposition in both schemes is also analyzed. In the third stage, the features which contain different visual information are discriminatively combined with optimal weights for object recognition under the framework of Multiple Kernel Learning [25]. We name the learned features from the decomposition and extraction framework as hybrid mid-level features. It is worth noting that our contributions in this paper are intended in the same vein as the theories of visual information dissociation and integration in human visual perception [21]. The contributions of our paper include: • We propose an improved hierarchical image decomposition method to separate the structure and texture components from an image at multiple scales. This step is essentially different from existing frameworks, and important since it models the visual perception process as a multiplicity mechanism; • We propose a new framework for visual classification based on decomposition and extraction. In this framework, we highlight the role of property specific decomposition in the mid-level feature extraction and explore two efficient schemes to produce property related feature representation based on decomposed images. • To the best of our knowledge, our framework is the first work to study the decomposition and extraction mechanism in mid-level feature learning. The overall framework is evaluated through various experiments on

3414

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 2. Diagram of the proposed decomposition and extraction framework. The proposed framework consists of three stages. In the first stage, we decompose the input image into a series of structure images and texture images. In the second stage, we compute mid-level features with respect to the decomposed images. In the third stage, the features which contain different visual information are discriminatively combined with optimal weights for object recognition.

several public benchmarks. The results demonstrate the advantage of our framework. The rest of the paper is organized as follows. In Section II, we discuss the related works. In Section III, we introduce a new hierarchical image decomposition method for generating structure and texture images at multiple scales. Section IV elaborates on the hybrid mid-level feature extraction process. In Section V, we present experimental results, which show that the proposed method achieves satisfactory performance in visual classification. We conclude this paper and discuss future work in Section VI. Notation: In this paper, we define  the Banach space of functions L p () = {u(x), x ∈  :  |u(x)| p d x < ∞}, which is equipped with the norm u(x) L p = (  |u(x)| p d x)1/ p . 1 Define the Bounded Variation as BV () = {u(x) ∈ L () : space endowed with the  |u(x)| < ∞} which  is a Banach  BV-norm u B V =  |u|d x +  | Du|d x. Here Du denotes the generalized derivative of the function u. For any vector x, its i -th component is denoted by x[i ]. The trace of a matrix G is denoted by tr(G). II. R ELATED W ORK The importance of multiple visual cues has long been realized by the computer vision community. To capture the rich visual information, many types of low-level features have been proposed to describe different aspects of the visual characteristics, e.g. texture descriptors like Local Binary Pattern [26], and shape/edge descriptors like Scale-Invariant Feature Transform [23], Histogram of Oriented Gradients [27], etc. It is often helpful and sometimes necessary to combine various features together in order to gain a comprehensive understanding of an image. Thus, many visual feature combination approaches, which are related to statistics and machine learning have been proposed, such as probabilistic modelling [28], multi-task learning [29], boosting [30] and multiple kernel learning [15], [16], [31]–[33], etc.

Among most visual feature combination approaches for visual classification, kernel based approaches have attracted considerable attention. In the kernel based methods, the kernels for different features are combined together by a predefined or a learned weight via Multiple Kernel Learning (MKL) technique. For instance, Gehler et al. [15] firstly defined the feature combination problem by associating image features with kernel functions and proposed a new approach named LP-β to learn weights with a boosting-like two-stage strategy. However, LP-β performs the combination based on classifier scores and implicitly assumes that the feature dependency can be reflected by the classifier dependency. Thus, many MKL methods emphasize estimating the optimal combination at feature level directly. For example, the seminal work of MKL dates back to [31], where the authors formulated the problem as a semi-definite programming problem. Varma et al. [16] reformulated the objective function of SVM with joint optimize kernel weights and classifiers simultaneously. Orabona et al. [32] proposed the p-norm MKL algorithm by integrating group sparse regularization in the original MKL problem. Kumar et al. [33] formulated the MKL problem as a standard linear classification problem in a new instance space named “K-space”. In essence, MKL selects features by minimizing the loss or error rates in a supervised scenario. Thus, kernel based approaches are powerful tools for feature combination from a purely discriminative view. However, how to integrate the multiple features from a generative view? Some related works adopt a generative representation which is adaptive to different visual components, including the structure part and the texture part. For example, Guo et al. [34] proposed a primal sketch model, of which the original definition was given by Marr [35], to integrate the structure part and texture part with image primitives and markov random fields respectively. In [24], Zhu et al. observed that the compositional architecture of an image is fundamentally important for pattern modeling, and defined two types of subspaces for image representation, namely, explicit

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

manifolds for structure primitives and implicit manifolds for stochastic textures. Then, Si et al. [36] proposed a hybrid image template representation to integrate the implicit and explicit manifolds for object recognition. In [37], Shotton et al. proposed an efficient synergy of contour and texture cues for object recognition based on the boosting framework. They used contour fragment and texture-layered filter to describe various cues, and then selected the best features by the cost-learning algorithm. Recently, Law [38] also presented a novel hybrid image representation based on the codingpooling pipeline. They emphasized that the extraction of lowlevel features differs with regard to their gradient magnitude, and designed an efficient merging scheme for low contrast and high contrast information in an extended bag-of-words pipeline. The work of Ma et al. [39] is most similar to our work. They described a novel computational model named “recognition-through-decomposition-and-fusion”. Specifically, in [39], a coupled conditional random field was developed to decompose the interactive processes of contour and texture. Then, spatial pyramid matching [4] was used on both contour and texture channels to extract features. However, they labeled each pixel as contour or texture with the trained conditional random field model rather than really decomposing the image into two different semantical components. Thus, the visual information is also mixed together in the feature extraction stage. Unlike previous methods, we decouple the structure and texture components by image decomposition, and then extract and fuse features from the decomposed images. To the best of our knowledge, our framework is the first one which incorporates the hierarchical image decomposition in mid-level feature extraction. III. H IERARCHICAL I MAGE D ECOMPOSITION As aforementioned, our approach consists of three stages. In this section, we present the details of the first stage, i.e. hierarchical image decomposition. A. Image Structure-Texture Decomposition Let an image be represented by a function f (x) :  → Rd on a domain of . In this paper, we only consider 2D open domains and gray images (typically,  = R2 or (0, 1)2 , and d = 1). Thus, image structure-texture decomposition refers to splitting an image f into two components: the structure part u and the texture part v. Ideally, the structure part u contains sharp edges or sketchable patterns, and meanwhile the texture part v contains oscillatory or non-sketchable patterns in f . A general decomposition framework can be summarized as: inf

(u,v)∈X ×Y

{E( f, λ; u, v) = F1 (u)+λF2 (v) : f = u + v}, (1)

where F1 (·) and F2 (·) are two functions on appropriate spaces X and Y. Here, F1 (u) < ∞ and F2 (v) < ∞ if and only if (u, v) ∈ X × Y. When choosing the two functions for decomposition, F1 (u)  F1 (v) and F2 (v)  F2 (u). Most representative image structure-texture decomposition methods can be summarized as choices of X and Y, as well

3415

as functions F1 (·) and F2 (·). A typical choice for F1 (·) is the total variation uT V =  | Du|d x, a semi-norm of BV-norm, where u ∈ BV . The total variation penalizes random oscillations in the signal, and allows piecewise smooth, such that homogeneous regions with sharp boundaries are generated. The Rudin, Osher, and Fatemi (ROF) model [40] is a pioneering total variation based image decomposition model, which minimizes over u ∈ BV and v ∈ L 2 . However, as pointed out in [41], L 2 -norm cannot characterize the texture or oscillatory v components, which do not have small norms in L 2 . To overcome this drawback, different norms are proposed to replace  ·  L 2 , e.g. the G-norm [41], the div(L p )-norm [42], the H −1-norm [43] and the nuclear (trace) norm [44], etc. Although many elegant texture norms or spaces have been proposed, most of them are difficult to handle numerically. In this paper, we adopt the TV-L 1 decomposition model. Given an image f (x) ∈ L 1 , the TV-L 1 model is defined as minimizing the following problem:  inf { | Du|d x + λv L 1 : f = u + v}. (2) (u,v)∈B V ×L 1



This type of energy function was proposed by S.Alliney [45] in a discrete setting for 1D signals. Later, Chan first theoretically analyzed the model in [46], exhibiting some very specific properties. Two of these properties make it suitable for our purpose in this paper: • The TV-L 1 model encourages an edge-preserving and uncorrupted decomposition. Like most edgepreserving decomposition methods (e.g., the edgepreserving smoothing filter [47], [48] or anisotropic diffusion based methods [49]), the TV-L 1 model can keep the sharp object edges while avoiding the staircase effects. As described in [50], the TV-L 1 model can be viewed as an unbiased case of anisotropic diffusion. This property is very important since most features in visual classification are edge-dependent. • The TV-L 1 model can also be considered as a scaledependent and intensity-independent image decomposition model. Compared with other methods that rely on magnitudes of pixel differences, the TV-L 1 model tends to decompose the image by considering the scale of the image components and keeps the intensity in the decomposition. Thus choosing an appropriate λ will allow us to separate the structure and the texture components in an image according to their scales1 (see Fig. 3). This property was proved in [51] by reforming the TV-L 1 model as an equivalent non-convex geometric problem. This scale driven decomposition property has a strong relationship with the feature representation in a visual task. A practical application of this property can be found in face recognition [52]. B. Hierarchical TV-L 1 Image Decomposition In this subsection, we describe the proposed hierarchical TV-L 1 image decomposition method. With a set of varying 1 The scale can be analytically defined by the G-value in [51].

3416

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

After k steps, we can obtain the following hierarchical representation of f :

Fig. 3. Illustration of the scale-dependent property of TV-L 1 decomposition. The decomposed structure images with different scale parameter λ are shown in the figure and the results depict that λ controls the details retrained in the structure image with the geometric size, rather than the intensity value.

(5) f ≈ u λ1 + u λ2 + u λ3 + · · · + u λk + v λk . k As one can see, the partial sum f λk = i u λi provides a multi-layer description of f in an intermediate scale of spaces, between BV and L 1 . Thus, we can obtain two sets of images, namely, structure images F and texture images V: F = { f λ1 , f λ2 , . . . , f λk },

V = {v λ1 , v λ2 , . . . , v λk }, (6)

and each pair of images satisfies f = f λi + v λi . C. Numerical Solution With Adaptive PDHG

Fig. 4. The diagram of hierarchical image decomposition. At each layer, the repeated application of TV-L 1 will utilize the previous texture image. Thus, the decomposition follows an iterative scheme.

In this subsection, we provide the details of the numerical algorithm for hierarchical TV-L 1 decomposition. First of all, we redefine the images f , u, v ∈ R M×N as the concatenating vectors in the discrete scenario. The discrete gradient is the map ∇ : R M×N → (R M×N )2 which is implemented with a forward difference and the Neumann boundary condition   (∇u)1i, j , (7) (∇u)i, j = (∇u)2i, j where

scale parameters (such as the parameter λ in Eq.2), we can split the structure and the texture components at several scales. Indeed, both the definitions of texture and structure have a strong connection with the scale. For example, whatever is interpreted as texture at a given scale will consist of significant structure when viewed under a refined scale. This phenomenon was explained as a perceptual transition, [24]. Therefore, to better represent the structure and texture information at different perceptual scales, it is essential to extend mono-scale image decomposition to a multi-scale counterpart. Simply, to obtain a k-scale decomposition, we can solve the TV-L 1 model k times, each time with a different parameter λ. However, as suggested in [53] and [54], an iterative scheme is better for the multi-scale image decomposition. Thus, following the idea from [54], we propose a hierarchical TV-L 1 decomposition method with an iterative scheme (see Fig. 4). Then, the hierarchical decomposition method can build a one-parameter family of paired images for the feature extraction. More specifically, we would like to construct a k-scale decomposition on the input image f ∈ L 1 () at a given set of parameters λ = {λ1 , λ2 , . . . , λ K }, where λ1 < λ2 < · · · < λ K . Without loss of generality, let E( f, λ, BV, L 1 ) denote the objective function which is interpreted as a TV-L 1 model in Eq.2. Now we start our hierarchical decomposition with an initial scale λ = λ1 : inf {E( f, λ1 , BV, L 1 ) : f = u λ1 + v λ1 }.

u λ1 ,v λ1

(3)

Usually, the initial λ1 is kept small, thus v λ1 retains many details which is close to f . After that, we repeat the decomposition on the previous texture image step by step: inf

{E(v λk , λk+1 , BV, L 1 ) : v λk = u λk+1 + v λk+1 }. (4)

(u λk+1 ,v λk+1 )

(∇u)1i, j = ui+1, j − ui, j ,

i < M; (∇u)1M, j = 0,

(∇u)2i, j = ui, j +1 − ui, j ,

j < N; (∇u)2i,N = 0.

(8)

Thus, the discrete TV-L 1 model is given by min ∇u1 + λu − f 1 , u

(9)

where the discrete TV seminorm  M can  N be written in term of  · 1 as uT V = ∇u1 = i=1 j =1 (∇u)i, j 2 and  (∇u)i, j 2 = ((∇u)1i, j )2 + ((∇u)2i, j )2 . Although solving the above function is challenging due to the non-differentiability of the 1 norm at zero, one can still find many numerical methods to solve the model. In our method, we use a first-order primal dual hybrid gradient (PDHG) algorithm to solve the TV-L 1 model. The PDHG was proposed in [55] to solve efficiently a large family of non-smooth convex optimization with linear convergence. In order to apply the PDHG algorithm to Eq.9, we should rewrite it as a saddle point or primal-dual problem: min max ∇u, ξ − δ·2 ≤1 (ξ ) + λu − f 1 , u

ξ

(10)

where ξ ∈ (R M×N )2 is the dual variable. The function δ·2 ≤1 (·) is the indicator function of the unit ball. The PDHG algorithm can be interpreted as a forward-backward algorithm which basically consists of alternating a gradient ascending in the dual variable ξ and a gradient descending in the primal variable u. Thus, applying PDHG to solve Eq.10, we have the steps of the form: ut +1 = proxτt (ut − τt div ξt ) ξt +1 = proxσt (ξt + σt ∇(2ut +1 − ut )),

(11)

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

3417

Algorithm 1 Hierarchical TV-L 1 Image Decomposition

Fig. 5. Some results obtained by the proposed hierarchical image decomposition. The Images are selected from Caltech-101 and Scene-15 database.

where the discrete divergence is the map div : (R M×N )2 → R M×N with a backward difference and the Dirichlet boundary. Specifically, the proxiaml operators are 1 u − u ˆ 22 + λu − f 1 u 2τt 1 proxσt (ξˆ ) = arg min ξ − ξˆ 22 + δ·2 ≤1 (ξ ), ξ 2σt proxτt (u) ˆ = arg min

(12)

where the parameters τt and σt are step sizes of the primal and dual steps, respectively. Indeed, the speed of the original PDHG algorithm heavily relies on the stepsizes. A practical adaptive scheme can make a faster convergence than a constant-stepsize scheme. Thus, we use an adaptive scheme proposed by [56] to automatically tune the stepsize in Eq.11 under the residual balancing principle. More comprehensive analysis about the convergence guarantees can be found in [55] and [56]. For convenience, we list the numerical details of our hierarchical decomposition with the adaptive primal-dual TV-L 1 solver in Algorithm 1. We show some image decomposition results in Fig. 5 based on the proposed Algorithm 1. As can be seen, each image is decomposed into two sets of property specific images. The hierarchical decomposition efficiently extends the TV-L 1 , so that the multi-scale visual cues are obtained. D. Relationship With Previous Work Our hierarchical TV-L 1 image decomposition method can provide a multi-scale representation which consists of a oneparameter family of decomposed images, the parameter indicating the degree (scale) of decomposition. Although our method is most related to the work [54], there are also two important differences: 1) The employed image decomposition models are different. Their work repeats applying ROF (i.e. TV-L 2 ) model for the image decomposition at each layer, while our method adopts the TV-L 1 model instead. With the edge-preserving and scale-dependent characteristics, we argue that the TV-L 1 model is more suitable as a preprocess of the hybrid feature extraction. 2) The numerical algorithms used for the hierarchical decomposition are also different. Their

work uses the Gauss-Seidel iterative scheme to solve the Euler-Lagrange equation associated with the ROF model, while we propose a more efficient adaptive PDHG method for solving the TV-L 1 model at each layer. Meanwhile, we find that several hierarchical decomposition methods developed in the framework of mathematical morphology (e.g., Differential Morphological Profiles (DMP) [57], Differential Area Profiles (DAP) [58] and Constrained Connectivity [59]) are also similar to our proposed method. For example, in the framework of DMP and DAP, the image content can also be decomposed hierarchically based on the scales (i.e. the value of a given descriptor, size or statistically derived metric), and give a

3418

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

multi-scale representation of fine to coarse image details. Thus morphological hierarchical representations can often be characterised by the properties of the used morphological operators (or granulometry, [58]). However, in our method, a non-linear multi-scale representation is generated via a set of TV-L 1 operators. IV. H YBRID M ID -L EVEL F EATURE E XTRACTION In this section, we describe how to extract the hybrid mid-level features and perform classification. As described in the previous section, we have already got the decomposed components, from which we can extract better features, named the hybrid mid-level features. In the extraction we utilize and extend the sparse coding-pooling pipeline. Specifically, we explore two schemes to produce the hybrid mid-level features. One is based on a single-stage network over handcrafted feature descriptors and the other is based on a multistage network which can learn features from the pixel level automatically. The extracted features based on our framework which includes image decomposition and property specific extraction have shown a significant advantage. A. Hybrid Mid-Level Feature Extraction on Hand-Crafted Features In the computer vision community, there have been a large number of works on extracting hand-crafted features for image classification or matching. In general, these features are specifically designed for describing structure or texture cues. For example, LBP [26] are widely used for describing texture cues. In contrast, structure or sketch cues tend to be better described by HOG [27], shape context descriptors [60], and SIFT [23], etc. Recently, many researchers consider using these local hand-crafted features in a global representation for better recognition [2], [4], [5]. In most of these methods, a single-stage network architecture is built by using sparse coding on top of hand-crafted SIFT features to achieve a good performance. In this subsection, based on the image decomposition, we extend the singe-stage network framework to a new hybrid counterpart. More specifically, for each structure image fλk at the scale c λk , large scale SIFT descriptors xi k (e.g. typically 24 × 24 pixels) are densely extracted at N locations identified with their indices i = 1, 2, . . . , N from fλk . However, to capture the texture details we employ small scale SIFT and LBP descriptors (e.g. 12 × 12 pixels) as texture descriptors t xi k , which are densely extracted from texture images vλk . After that, with the set of local descriptors, we learn two set of dictionaries, i.e. structure dictionary D ck and texture dictionary D tk at each scale by optimizing the following cost function: n 1 ˜ x,  x˜ i − D α˜ i 22 + γ α˜ i 1 , R( D, ˜ α) ˜ = min D n i

subject to  D(:, p)2 ≤ 1

∀ p ∈ 1, 2, . . . , P

(13)

where D(:, p) is the p-th column of D and γ is parameter controlling the sparsity penalty. And the cost functions

Fig. 6. Illustration of the advantage of the proposed framework on feature extraction. We densely compute two scale SIFTs (with a patch size of 12 × 12, or 24 × 24 pixels) on original image, structure image and texture image respectively. The SIFT image is visualized the same as in [61].

R( D ck , x ck , α k ) and R( D tk , x tk , β k ) correspond to training structure and texture dictionaries respectively. With the trained dictionaries, the coding step is performed at each location by solving an l 1 -regularized least squares problem to obtain the structure code αik and texture code βik respectively. At last, the spatial pyramid max pooling is used to generate the final features. The features of each spatial cell are the max pooled codes, which are simply the componentwise maxima over all sparse codes within a cell at each scale. If we denote the spatial cell as Sm , m = 1, 2, . . . , M, where M is the number of cells. Thus, the max pooled codes c t are hmk [·] = maxi∈Sm |αik [·]| and hmk [·] = maxi∈Sm |βik [·]|. c c c Then, the final signatures are g ck = [h1k , h2k , . . . , h Mk ] and t t t g tk = [h1k , h2k , . . . , h Mk ]. Compared with the exiting single-stage networks, we use the decomposed images, extending the original coding-pooling pipeline to a pipeline with a couple of channels. This simple extension based on image decomposition will significantly benefit the mid-level feature extraction. Specifically, under the original framework [5], hand-crafted features are extracted from the raw image patch and thus the mixed components within the image patch are described in a mixed manner. In general, a better representation needs different property patches described with suitable features. As shown in Fig. 6, we visualize the SIFT image (per-pixel SIFT) by the method used in [61], and pixels with different colors reflect different structures. Note that patch A contains both structure and texture information, thus the SIFT descriptors are unable to accurately describe the basic property of the patch. As shown in Fig. 6, the color is mixed in both patch A1 and A2, but when we extract SIFT from structure and texture image respectively, we find it induces a clearer description for both properties, such as patch B2 and C1, where B2 contains a consistent color, which shows the sharp contour and C1 contains more complex patterns, which denote the texture. Thus, the decomposition provides a flexible choice

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

3419

Fig. 7. Diagram of the hybrid multi-stage sparse coding pipeline. For structure channel, we choose large size patch (10 × 10 pixels in the first stage and 9 × 9 pixels in the second stage) to learn the features. For texture channel, small size patch (5 × 5 pixels and 4 × 4 pixels) is used.

that we can apply a suitable descriptor, such as different scale SIFT or LBP at the suitable component image, structure or texture. B. Hybrid Mid-Level Feature Extraction on Learned Features Interestingly, more recent mid-level feature learning methods are toward the pixel level [17], [62], learning a hierarchy of features based on the raw local patch to replace the handcrafted features. The learned features are attractive due to their adaptivity and scalability. In this subsection, we also implement a scheme to learn hybrid features from the raw patch by a multi-stage sparse coding pipeline. Concretely, we explore different two-stage channels for structure and texture image respectively as shown in Fig. 7. Firstly, in both channels, we train the different dictionaries D ck and D tk by K-SVD [63] which is a highly effective method of training overcomplete dictionary for sparse representation at pixel level. Then the batch orthogonal matching pursuit [63] is used to compute the sparse codes αik and βik of structure and texture patch (typically, the patch has a specific size for structure and texture image respectively) in each stage. After that, the max pooling is applied to aggregate the sparse codes as the following: hcmk [·] = max [max(αik [·], 0), max(−αik [·], 0)]. i∈Sm

t

(14)

The texture feature hmk can be obtained in the similar manner. Particularly, we split the positive and negative components of the codes separately according to [64]. Thus, the positive and negative responses will be weighted respectively to improve the discriminative power. Finally, we use spatial pyramid max pooling [4] in the last stage to generate the global representation by concatenating the pooling features from different spatial cells. Thus the mid-level features c c c are also denoted as g ck = [h1k , h2k , . . . , h Mk ] and g tk = tk tk tk [h1 , h2 , . . . , h M ]. In Fig. 8 we visualize the first stage dictionaries learned from the original, structure and texture images by K-SVD. Note that they are different from one another. The dictionary learned from the original image contains many mixed atoms. However, the atoms in the structure dictionary have large pieces of a constant value with clear boundaries, and the atoms

Fig. 8. Examples of learned dictionaries from different component images. All dictionaries are learned by K-SVD with a fix size 256.

of the texture dictionary capture complex patterns. Thus in our method, the raw visual information is decomposed into separate channels to learn a more meaningful representation. Each of the channels captures a distinct perceptual aspect of objects, and can be used to build a hybrid multi-stage feature. C. Feature Combination and Classification In this subsection, we introduce how to incorporate the multiple appearance information from image decomposition to learn the optimal representation for object classification. As mentioned above, assuming we define K scales for image decomposition, we will obtain L = 2K mid-level features {g c1 , . . . , g c K, g t1 , . . . , g t K } for each instance. For notational convenience, we use gi = [gi1 , . . . , giL ] to denote the feature of the i -th instance fi with the label yi ∈ {1, 2, . . . , C}, where C is the number of classes and i = 1, . . . , n. Ideally, the feature combination can be implicitly completed by a linear classifier on the concatenating feature gi . However, the concatenating feature has a large dimension which makes the computation intractable by a linear classifier. In our method, we transform the feature combination problem into a kernel combination task (in this paper we only consider linear kennel in our method). More specifically, we focus on the convex combination problem which is given by: G=

L 

θ []G  , 1 θ = 1, θ  0,

(15)

=1

where G  ∈ Rn×n is the Gram matrix computed from all the samples with -th feature gi . In general, the feature gi corresponds to a certain perceptual aspect, which reflects the structure or texture information at a given scale. Thus feature combination is related to the determination of the optimal weight θ . In our method, we employ the algorithm [25],

3420

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 9. Examples of decomposed images with different parameter k. Both images are chosen from Caltech-256 dataset. We can see that the decomposed images efficiently reflect the structure and texture cues at multiple scales when 1 ≤ k ≤ 6. However, when k > 6, the structure images approximate the original images and the decomposition becomes meaningless.

an efficient and parameter-free multi-class multiple kernel learning algorithm, to learn the optimal weights. Specifically, we formulate the learning problem as a semi-infinite linear program (SILP): max η θ˜

subject to θ˜  0 θ˜  b = 1 C L   θ˜ []S (a j ) ≥ η ∀a j ∈ Rn

(16)

=0 j =1

where 1  ˜ a G  a j − b[]a j zj 4 j ⎧  ⎨ n − ni if yi = j ni n  z j [i ] = ⎩− n i otherwise, n

S (a j ) =

(17)

(18)

and b[] = tr( G˜  ), G˜  = P G  P for  = 1 : L, G˜0 = I and P = I − n1 en en is the centering matrix. Note that θ˜ and η are linearly constrained with an infinite number of constraints in Eq.16. To solve the problem, the column generation technique [65] is used to alternatively optimize θ˜ and a j . In each iteration, a j is solved by the following linear equation   L 1 ˜ 1 ˜ I+ (19) θ[] G  a j = z j . 2 2θ˜ [0] =1 With the fixed a j , Eq.16 becomes a linear programming problem with finite constraints which can be solved efficiently. Indeed, the θ˜ [0] is the regularization parameter which is jointly we can obtain optimized in Eq.16. Thus, with the optimal θ˜ ∗ , L ˜∗ the optimal feature combination as G ∗ = =1 θ []G  . ∗ Then, the weighted feature Gram matrix G will be used in the dual SVM solver to train classifiers.

V. E XPERIMENTS In this section, we evaluate the effectiveness of our proposed method with comprehensive experiments on three datasets, namely, the 15-Scenes dataset [3], the Caltech-101 dataset [12], and the Caltech-256 dataset [13]. We first briefly introduce the three datasets, and give our experimental setup. Then we conduct a series of experiments on these datasets, and analyze the experimental results respectively. A. Experimental Datasets The 15-Scenes dataset [3] is a widely-used database for scene classification. It contains 15 categories and 4,485 images in total. Each category contains 200 to 400 images. The Caltech-101 dataset [12] is a popular benchmark dataset for object categorization. It consists of 9,144 images from 101 categories and 1 background category. The number of images per category varies from 31 to 800. Compared with the above two datasets, the Caltech-256 dataset is more challenging. It contains 256 categories and 29,780 images besides a background category. Each category contains at least 80 images. B. Experimental Setup 1) Image Decomposition: In Algorithm 1, there are two essential parameters for image decomposition: 1) the scale parameters λk , k = 1, . . . , K ; and 2) the number of scales K . The parameter λk is responsible for the balance between structure and texture penalty; increasing the value of λk will cause the structure image to approximate the original image as shown in Fig. 9. In our experiments, we set λk = 2k−1 σ according to [54]. The parameter σ decides the variation between two successive scales; a small σ indicates a relatively slow transition, and vice versa. In practice, we find σ = 0.1 is a good default setting. As for the choice of K , we experimentally test the approximation error  fλk − f 2 between structure image and original image in a wide range of k. As shown in Fig. 10, we decompose 10 images randomly chosen from the

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

Fig. 10.

3421

The variation of the approximation error  fλk − f 2 as k varies. Fig. 12. Convergence rate comparison for the TV-L 1 decomposition with Non-adaptive/Adaptive PDHG method. (a) The example image and decomposed results. (b) The convergence curves with λ = 0.25. TABLE I F EATURE E XTRACTION T IME ON A T YPICAL 300 × 300 I MAGE

TABLE II C LASSIFICATION A CCURACY C OMPARISON ON 15-S CENES W ITH 100 T RAINING E XAMPLES

Fig. 11. The architectures of mid-level feature extraction. (a) The baseline architectures and ours. (b) The schemes of feature extraction.

15-Scenes dataset and Caltech-256 dataset in order to show the variation of the approximation error as k varies. In fact, as the value of k increases, more details will be retained in the structure image and the approximation error will gradually decrease. According to Fig. 10, we find the error tends to be zero when k > 6 for most images. Thus it is reasonable to choose K = 6 empirically. Also Fig. 9 shows two examples of decomposed images with a varying k. As one can see, the decomposed images efficiently reflect the structure and texture cues when 1 ≤ k ≤ 6. Thus, throughout our experiments, we fix K = 6, σ = 0.1 unless otherwise stated. 2) Baseline Architectures: To comprehensively evaluate the proposed framework, we analyze the existing mid-level feature extraction methods, and find their architectures can be divided into three basic types, as described below. We take these three architectures as the baseline in our experiments (see Fig. 11(a)). Arch. 1 (single feature + mono-scale): Typically, in this architecture, a single low-level feature with a mono-scale setting is extracted in a coding-pooling pipeline [4], [5].

Arch. 2 (single feature + multi-scale): It has been acknowledged that using multi-scale features increases the number of low-level cues for generating the mid-level features, and thus favorably impacts performances. So, this architecture with a single feature and multi-scale setting is widely used for classification [6], [10], [11], [38], [64]. Arch. 3 (multiple features + multi-scale): In this Arch., diverse features are extracted at multiple scales. As a result, the mid-level features can be significantly improved by considering the complementary low-level descriptors of different aspects such as shape or texture. [15], [16], [66]. Beyond the above basic architectures, we integrate decomposition and extraction in our proposed framework, and obtain two new architectures, i.e. Arch. 4 and Arch. 5 in Fig. 11(a). In these two architectures, multiple features with diverse types or scales are separately extracted from decomposed images. For a clear expression, we outline the details of feature extraction in Fig. 11(b). Note that:

3422

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Fig. 13. Accuracy comparison with different baseline architectures on 15-Scenes. (a) and (b) are the comparisons with hand-crafted features extraction scheme. (c) is the comparison with learned features extraction scheme.

1) In the extraction scheme on hand-crafted features, we argue that small scale SIFT and LBP are suitable for describing texture cues and large scale SIFT is suitable for describing structure cues. Thus, compared with Arch. 2 and Arch. 4, different scale features should be extracted from structure and texture images respectively. In Arch. 5, we further extract LBP and small scale SIFT features from texture images to improve the representation of the texture cues. 2) In the extraction scheme on learned features, we hold that different size patches should be extracted from structure and texture images respectively to learn the property specific features. Specifically, we use the large size patch (10×10 pixels in the first stage and 9×9 pixels in the second stage) to learn the structure dictionary because the sketchable or structural pattern is always contained in a large region. However, texture details can be viewed as a composition of many smaller structures. It is reasonable to describe texture with the small size patch (5 × 5 pixels in the first stage and 4 × 4 pixels in the second stage). Meanwhile, Arch. 4 and Arch. 5 use the same features with baseline methods Arch. 2 and Arch. 3 for a fair comparison. The only difference is that we extract property specific features from the decomposed images instead of the original image. In all experiments, we resize the images to be no larger than 300 × 300 pixels with a preserved ratio. To be consistent with previous works, we use the dense grid sampling strategy (with a step of 4 pixels) to extract low-level features, which include the hand-crafted features and learned features. For an efficient computation, we set the size of the dictionary to be 1024 in the handcraft feature learning scheme, and 300 (first stage), 1024 (second stage) in the learned feature scheme. Furthermore, spatial pyramid matching with levels of [1 × 1, 2 × 2, 4 × 4] is performed to generate the final signature, and power normalization [67] is used before classification. 3) Classifier Training and Evaluation: We conduct 3 random trials on each dataset, and report the mean prediction rate and the standard deviation per class for evaluation. For training, we adopt the one-vs-all multi-class scheme. And each classifier is learned by LIBSVM [68] wherein the regularization parameter is fixed to 10. Moreover, we consider the linear kernel for each feature and estimate the optimal weight by the

MKL algorithm induced in Section 4.3. Then the classifiers are trained by the combined linear kernel. All the experiments are conducted on several servers with 24GB memory and Intel 4-core 2.83 Ghz. 4) Computation Cost: The computation cost of the proposed pipeline is dominated by following the three key steps: image decomposition, feature extraction and kernel combination. Firstly, with the efficient adaptive PDHG method, the image decomposition takes ∼ 2.44 seconds on a 300 × 300 image with a single server. As can be observed in Fig. 12, the adaptive PDHG method is more practical than the nonadaptive scheme, which has a faster convergence speed to satisfy the convergence requirement. Secondly, the running time of feature extraction heavily depends on the setting in the architecture, such as the type of feature, the size of dictionary, etc. For convenience, we list the running time under different settings for a typical 300 × 300 image with a single server in Table I. Finally, the computational cost of kernel combination can be evaluated in training and testing processing respectively. For training, we learn the optimal combination weights by solving an SILP problem in Equation 16. The time complexity of the SILP formulation is dominated by solving the linear system which has a complexity of O(n 3 ), where n is the number of training samples. More practically, choosing a small set of samples can speed up training without losing much accuracy. As for testing, we only need to average the kernels with the learned weights. Thus, the computational cost during the testing scales linearly with the number of kernels. C. Experimental Results for 15-Scenes We report the accuracy of each architecture over various numbers of training samples in Fig. 13. We gradually increase the size of the training set from 20 to 100 images per category with a step of 20, and utilize the rest as the test set. As we can see from Fig. 13(a), Arch. 2 (multi-scale) always outperforms Arch. 1 (mono-scale), and the maximum improvement is 1.26%. However, our method (Arch. 4) can achieve even greater improvement in scene classification, outperforming Arch. 1 by 3.27% and Arch. 2 by 2.20% via adding decomposition in feature extraction. Interestingly, we find that our method equipped with LBP+SIFT features will achieve a large improvement on 15-Scenes dataset. As shown in Fig. 13(b), compared with Arch. 1 and Arch. 3, our method (Arch. 5) improves the accuracy by 8.87% and 3.32%

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

3423

Fig. 14. Accuracy comparison with different baseline architectures on Caltech-101. (a) and (b) are the comparisons with hand-crafted features extraction scheme. (c) is the comparison with learned features extraction scheme. TABLE III C LASSIFICATION A CCURACY C OMPARISON ON C ALTECH -101

Fig. 15. Accuracy comparison with different baseline architectures on Caltech-256. (a) and (b) are the comparisons with hand-crafted features extraction scheme, under 15 and 30 training respectively. (c) and (d) the comparison with learned features extraction scheme, under 15 and 30 training respectively.

respectively. The possible reason is that a scene contains more texture information. By adding property specific decomposition and well designed texture descriptors, each image can be more accurately represented. Besides, the accuracy with the learned features is shown in Fig. 13(c). We observe that the features learned from raw pixels with a two-stage network are competitive with the highly specialized features SIFT and LBP. Our decomposition and extraction framework (Arch. 5) also consistently improves the performance of the baseline frameworks (Arch. 1 and Arch. 3). We also compare our method with recently reported high achieving coding-pooling methods in Table II, e.g. singlestage sparse coding [4], [5], biologically inspired method [10], fisher coding [69], [70], and sophisticated coding-pooling methods [2], [19], [38], [66], [71], [72]. From the results we can see that most methods improve the performance by refining the coding or pooling schemes. However, our method only extends the basic sparse coding pipeline with a decomposition preprocess, and still achieves improved performance. Our highest score is 88.07% on 100 training

TABLE IV C LASSIFICATION A CCURACY C OMPARISON ON C ALTECH -256

samples, which almost achieves the state-of-the-art performance, only slightly lower than the highly designed coding methods [19], [66].

3424

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

TABLE V C LASSIFICATION R ESULTS (AP IN %) C OMPARISON ON PASCAL VOC2007

D. Experimental Results for Caltech-101 In this experiment, we evaluate our method on Caltech101 object recognition dataset. To stay consistent with the previous work [39], we set the number of training images to 5, 10, 15, 20, 25, and 30 per category and take the rest as the test set. Performance comparisons with the baseline methods are shown in Fig. 14. One can see that our method consistently yields better results than the baseline methods. The comparison results demonstrate the effectiveness of the proposed decomposition and extraction scheme for object recognition. In Table III, comparisons with the related methods are provided. We summarize our findings as follows: Firstly, compared with the most related work [39], our method generates a total increase of 8.98% over the result in [39] with 30 training samples. Although our method and [39] both belong to the “recognition-throughdecomposition-and-fusion” method, the results show that our method has a better ability to decompose and fuse different visual cues. Secondly, our method achieves slightly better performance than the MKL methods [15], [33] which combine totally 39 kernels. In [15] and [33], kernels are computed with diverse types of image features and different levels of a pyramid on the original image. However, our method extracts property specific features from six scales of decomposed images, which results in only 12 kernels in Arch. 4, 18 kernels in Arch. 5. Thus the results show that our decomposition and extraction schemes have the ability to generate more complementary and discriminative features for object classification. Thirdly, our method reaches near state-of-the-art performance [19], [38], [72] and the highest accuracy is 78.96% with a dictionary size of only 1024. One should note that [19], [38], and [72] use a more complex coding-pooling pipeline, while we only extend the basic sparse coding pipeline [5] with the decomposition process and improve the accuracy in [5] by almost 6% on this dataset.

E. Experimental Results for Caltech-256 In this subsection we report the performance of our method on the Caltech-256 dataset in Fig. 15 and Table IV. As a standard practice [15] we evaluate our method under two different settings: selecting 15 and 30 images per category as training data respectively, and 25 images per category for testing.

Fig. 15 shows the comparison with the baseline architectures. It can be seen that by learning hybrid mid-level features from decomposed images, the classification accuracies are significantly improved. For example, in Fig. 15(a), when the features are obtained by a single-stage network on top of handcrafted features, our proposed method (Arch. 5) improves the baseline method (Arch. 3) by 4%, and (Arch. 4) improves the baseline method (Arch. 2) by 4.5% with 15 training samples. The improvements are consistent with 30 training samples, e.g. in Fig. 15(b). Meanwhile, we observe that our method with a two-stage feature learning scheme also greatly improves the baseline methods by 5.9%(v.s. Arch. 1) and 4.1% (v.s. Arch. 3) respectively, with 15 training samples in Fig. 15(c). It is interesting that the learned hybrid features with a twostage network from raw pixels are powerful than the features learned by a single-stage sparse coding scheme on Caltech-256 dataset. The highest accuracy (38.74 ± 0.38% for 15 training samples and 46.84±0.19% for 30 training samples) is obtained by Arch. 5 with learned features. This can be explained by the scalability of the learned features and the saturation of the hand-crafted features [73] when faced with large scale datasets. In Table IV, comparisons with related and state-of-the-art methods are provided. As can be observed in the table, our method (Arch. 5) outperforms related works [6], [5], and [15] and recent works [18], [10], [66], [67], and [72]. Due to the low complexity of the coding scheme, our method does not outperform [70]. However, we will show our framework also can be effectively combined with a more sophisticated coding approach for a comparable performance in next subsection. F. Combination With a Sophisticated Coding Scheme In this subsection, we demonstrate that a sophisticated midlevel feature extraction scheme can also be potentially integrated into our framework to further improve the performance. Specifically, we combine our framework with Improved Fisher Kernel (IFK) [67] which is the state-of-the-art coding method and achieves the best performance in [11]. To make a fair comparison, we directly use the IFK codes provided by the VLFeat open source library [74]. In the original implementation of IFK from VLFeat, the 80 dimensional PCA-SIFT features are densely extracted from images on a grid with a step of 4 pixels, the size of Gaussian Mixture Model is set to 256 and spatial pyramid matching with levels of [1 × 1, 3 × 1] is employed. To integrate the IFK into our framework, we extract different scale PCA-SIFT features from structure and texture images

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

respectively following the setting of Arch. 4. Thus, compared with the original IFK, the only difference is that we construct the Fisher vectors with property specific low-level features from the decomposed images instead of the original image. In Table IV, we show the related classification accuracies on Caltech 256. We use the prefix “Decomposition +” to denote the integration of IFK into our framework. It can be concluded from the table that: 1) for IFK, there are significant improvements when integrating it into our framework (from 36.36% to 41.11% for 15 training samples and from 44.14% to 49.19% for 30 training samples); 2) by combining the proposed decomposition approach with a more sophisticated coding approach such as IFK, the Decomposition + IFK method can outperform the state-of-the-art performance [70] on Caltech 256. Furthermore, we extensively evaluate the effectiveness of the integrated method (i.e. Decomposition + IFK) on the PASCAL VOC 2007 dataset with the same setting as above. The performance is evaluated by the standard PASCAL protocol which computes average precision (AP) based on the precision/recall curve. The detailed evaluation results (the mean of AP (mAP) across the 20 categories) are shown in Table V. It can be observed that Decomposition + IFK outperforms IFK on 15 classes, and improves the mAP from 61.7 to 62.5. Also, the proposed method is competitive to the winner in PASCAL VOC2007 and the recently developed methods [11], [70]. It again well demonstrates the effectiveness of the proposed framework when integrating with a sophisticated coding method. VI. C ONCLUSION In this paper, we presented a novel framework for visual classification based on hierarchical image decomposition and hybrid mid-level feature extraction. In our method, we highlight the role of property specific decomposition in the feature extraction and explore two efficient schemes to produce property related feature representation based on decomposed images. Extensive experimental results have demonstrated the effectiveness of the new method. We believe that this work provides a new and exciting research direction for visual classification. Moreover, many other visual tasks may also benefit from the property specific decomposition, e.g. detection, tracking and segmentation. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable and constructive comments on improving the paper. R EFERENCES [1] D. Hubel and T. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” J. Physiol., vol. 148, no. 3, pp. 574–591, 1959. [2] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-level features for recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 2559–2566. [3] F. F. Li and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2005, pp. 524–531.

3425

[4] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2006, pp. 2169–2178. [5] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 1794–1801. [6] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 3360–3367. [7] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proc. 26th Annu. Int. Conf. Mach. Learn. (ICML), Jun. 2009, pp. 609–616. [8] K. Kavukcuoglu, P. Sermanet, Y. L. Boureau, K. Gregor, M. Mathieu, and Y. LeCun, “Learning convolutional feature hierachies for visual recognition,” in Proc. Adv. Neural Inform. Process. Syst., Dec. 2010, pp. 1090–1098. [9] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust object recognition with cortex-like mechanisms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp. 411–426, Mar. 2007. [10] C. Theriaul, N. Thome, and M. Cord, “Extended coding and pooling in the HMAX model,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 764–777, Feb. 2013. [11] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: An evaluation of recent feature encoding methods,” in Proc. Brit. Mach. Vis. Conf. (BMVC), Aug. 2011, pp. 76.1–76.12. [12] F.-F. Li, F. Rob, and P. Pietro, “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories,” Comput. Vis. Image Understand., vol. 106, no. 1, pp. 59–70, Apr. 2007. [13] G. Gregory, H. Alex, and P. Pietro, “Caltech-256 object category dataset,” California Inst. Technology, Pasadena, CA, USA, Tech. Rep. CNS-TR-2007-001, 2007. [14] M. Everingham, L. van Gool, I. Williams, J. Winn, and A. Zisserman. (2007). The PASCAL Visual Object Classes Challenge [Online]. Available: http://www.pascal-network.org/challenges/VOC/ [15] P. Gehler and S. Nowozin, “On feature combination for multiclass object classification.,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep. 2009, pp. 221–228. [16] M. Varma and D. Ray, “Learning the discriminative power-invariance trade-off,” in Proc. IEEE 11th Int. Conf. Comput. Vis. (ICCV), Oct. 2007, pp. 1–8. [17] K. Yu, Y. Lin, and J. Lafferty, “Learning image representations from the pixel level via hierarchical sparse coding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 1713–1720. [18] S. Gao, I. W. H. Tsang, and L.-T. Chia, “Laplacian sparse coding, hypergraph laplacian sparse coding, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 92–104, Jan. 2013. [19] K. Balasubramanian, K. Yu, and G. Lebanon, “Smooth sparse coding via marginal regression for learning sparse representations,” in Proc. Int. Conf. Mach. Learn., Jun. 2013, pp. 289–297. [20] Y. Jia, C. Huang, and T. Darrell, “Beyond spatial pyramids: Receptive field learning for pooled image features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3370–3377. [21] M. Behrmann and R. Kimchi, “What does visual agnosia tell us about perceptual organization and its relationship to object perception,” J. Experim. Psychol., Human Perception Perform., vol. 29, no. 1, pp. 19–42, Feb. 2003. [22] B. Lorella, C. Clara, and S. Guiseppe, “Dissociation between contourbased and texture-based shape perception: A single case study,” Vis. Cognit., vol. 4, no. 3, pp. 275–310, 1997. [23] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004. [24] S.-C. Zhu, K. Shi, and Z. Si, “Learning explicit and implicit visual manifolds by information projection,” Pattern Recognit. Lett., vol. 31, no. 8, pp. 667–685, Jun. 2010. [25] J. Ye, S. Ji, and J. Chen, “Multi-class discriminant kernel learning via convex programming,” J. Mach. Learn. Res., vol. 9, pp. 719–758, Jan. 2008. [26] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 12, pp. 2037–2041, Dec. 2006. [27] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2005, pp. 886–893.

3426

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

[28] A. J. Ma, P. C. Yuen, and J.-H. Lai, “Linear dependency modeling for classifier fusion and feature combination,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1135–1148, 2013. [29] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparse representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 3493–3500. [30] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J. Comput. Vis., vol. 57, no. 2, pp. 137–154, May 2004. [31] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” J. Mach. Learn. Res., vol. 5, pp. 27–72, Jan. 2004. [32] F. Orabona, L. Jie, and B. Caputo, “Multi kernel learning with onlinebatch optimization,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 227–253, Jan. 2012. [33] A. Kumar, A. Niculescu-Mizil, K. Kavukcuoglu, and H. Daume, “A binary classification framework for two stage multiple kernel learning,” in Proc. Int. Conf,. Mach. Learn., Jul. 2012, pp. 1295–1302. [34] C.-E. Guo, S.-C. Zhu, and Y. N. Wu, “Primal sketch: Integrating structure and texture,” Comput. Vis. Image Understand., vol. 106, no. 1, pp. 5–19, Apr. 2007. [35] D. Marr, Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. New York, NY, USA: Henry Holt, 1982. [36] Z. Si and S.-C. Zhu, “Learning hybrid image templates by information projection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1354–1367, Jul. 2012. [37] S. Jamie, B. Andrew, and C. Roberto, “Efficiently combining contour and texture cues for object recognition,” in Proc. Brit. Mach. Vis. Conf. (BMVC), Sep. 2008, pp. 7.1–7.10. [38] M. Law, N. Thome, and M. Cord, “Hybrid pooling fusion in the bow pipeline,” in Proc. 12th Int. Eur. Conf. Comput. Vis., Oct. 2012, pp. 355–364. [39] X. Ma and E. Grimson, “Learning coupled conditional random field for image decomposition: Theory and application in object categorization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2008, pp. 1–8. [40] L. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D, Nonlinear Phenomena, vol. 60, nos. 2–4, pp. 259–268, Nov. 1992. [41] Y. Meyer, Oscillating Patterns in Image Processing and Nonlinear Evolution Equations: The 15th Dean Jacqueline B. Lewis Memorial Lectures. Providence, RI, USA: American Mathematical Society, 2001. [42] L. Vese and S. Osher, “Modelling textures with total variation minimizationadn oscillating patterns in imgae processing,” Dept. Math., Univ. California, Los Angeles, CA, USA, Tech. Rep. 02-19, 2002. [43] S. Osher, A. Sole, and L. Vese, “Image decomposition and restoration using total variation minimization and the H 1 ,” Dept. Math., Univ. California, Los Angeles, CA, USA, Tech. Rep. 02-57, 2002. [44] H. Schaeffer and S. Osher, “A low patch-rank interpretation of texture,” SIAM J. Imag. Sci., vol. 6, no. 1, pp. 226–262, Feb. 2013. [45] S. Alliney, “Digital filters as absolute norm regularizers,” IEEE Trans. Signal Process., vol. 40, no. 6, pp. 1548–1562, Jun. 1992. [46] T. Chan and S. Esedoglu, “Aspects of total variation regularized L 1 function approximation,” SIAM J. Appl. Math., vol. 65, no. 5, pp. 1817–1837, 2005. [47] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proc. 6th Int. Conf. Comput. Vis., Jan. 1998, pp. 839–846. [48] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 6, pp. 1397–1409, 2013. [49] P. Perona and J. Malik, “Scale-space and edge detection using anisotropic diffusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 7, pp. 629–639, Jul. 1990. [50] S. M. David, A.-F. Jean, and T.-F. Chan, “Scale recognition regularization parameter selection and Meyer’s G norm in total variation regularization,” Multiscale Model. Simul., vol. 5, no. 1, pp. 273–303, Jul. 2006. [51] W. Yin, D. Goldfarb, and S. Osher, “The total variation regularized L 1 model for multiscale decomposition,” Multiscale Model. Simul., vol. 6, no. 1, pp. 190–211, Apr. 2007. [52] T. Chen, W. Yin, X. Zhou, D. Comaniciu, and T. Huang, “Total variation models for variable lighting face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 9, pp. 1519–1524, Sep. 2006. [53] Z. Farbman, R. Fattal, D. Lischinski, and R. Szeliski, “Edge-preserving decompositions for multi-scale tone and detail manipulation,” ACM Trans. Graph., vol. 27, no. 3, pp. 67:1–67:10, 2008.

[54] T. Eitan, N. Suzanne, and V. Luminita, “A multiscale image representation using hierarchical (B V, L 2 ) decompositions,” Multiscale Model. Simul., vol. 2, no. 4, pp. 554–579, Jul. 2006. [55] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” J. Math. Imag. Vis., vol. 40, no. 1, pp. 120–145, May 2011. [56] T. Goldstein, E. Esser, and R. Baraniuk, “Adaptive primal dual optimization for image processing and learning,” in Proc. 6th NIPS Workshop Optim. Mach. Learn., Dec. 2013. [57] M. Pesaresi and J. A. Benediktsson, “A new approach for the morphological segmentation of high-resolution satellite imagery,” IEEE Trans. Geosci. Remote Sensing, vol. 39, no. 2, pp. 309–320, Feb. 2001. [58] M. Pesaresi, G. K. Ouzounis, and P. Soille, “Differential area profiles: Decomposition properties and efficient computation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8, pp. 1533–1548, Aug. 2012. [59] P. Soille, “Constrained connectivity for hierarchical image partitioning and simplification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 7, pp. 1132–1145, Jul. 2008. [60] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002. [61] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across scenes and its applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978–994, 2011. [62] L. Bo, X. Ren, and D. Fox, “Hierarchical matching pursuit for image classification: Architecture and fast algorithms,” in Proc. Adv. Neural Inform. Process. Syst., Dec. 2011, pp. 2115–2123. [63] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. [64] L. Bo, X. Ren, and D. Fox, “Multipath sparse coding using hierarchical matching pursuit,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 660–667. [65] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, Jan. 2006. [66] S. Yan, X. Xu, D. Xu, S. Lin, and X. Li, “Beyond spatial pyramids: A new feature extraction framework with dense spatial sampling for image classification,” in Proc. Eur. Conf. Comput. Vis., Oct. 2012, pp. 473–487. [67] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Proc. Eur. Conf. Comput. Vis., Oct. 2010, pp. 143–156. [68] C. C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1–27:27, Apr. 2011. [69] X. Zhou, K. Yu, T. Zhang, and T. Huang, “Image classification using super-vector coding of local image descriptors,” in Proc. Eur. Conf. Comput. Vis., Sep. 2010, pp. 141–154. [70] T. Kobayashi, “BOF meets HOG: Feature extraction based on histograms of oriented p.d.f. gradients for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 747–754. [71] F. Sadeghi and M. F. Tappen, “Latent pyramidal regions for recognizing scenes,” in Proc. Eur. Conf. Comput. Vis., Oct. 2012, pp. 228–241. [72] J. Feng, B. Ni, Q. Tian, and S. Yan, “Geometric p-norm feature pooling for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 2697–2704. [73] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes, “Do we need more training data or better models for object detection?,” in Proc. Brit. Mach. Vis. Conf., Sep. 2012, pp. 1–11. [74] A. Vedaldi and B. Fulkerson. (2008). VLFeat: An Open and Portable Library of Computer Vision Algorithms [Online]. Available: http://www.vlfeat.org/

Yuqiang Fang received the master’s degree in control science and engineering from the National University of Defense Technology, Changsha, China, in 2010, where he is currently pursuing the Ph.D. degree in pattern recognition and intelligent systems. He was with the Learning and Vision Research Group, National University of Singapore, Singapore, as a Research Assistant in 2013. His research interests lie in machine learning and computer vision, and their application to autonomous vehicle.

FANG et al.: DECOMPOSITION AND EXTRACTION: A NEW FRAMEWORK FOR VISUAL CLASSIFICATION

Qiang Chen is currently a Research Scientist with IBM Research, Melbourne, VIC, Australia. He was a Research Fellow with the Department of Electrical and Computer Engineering, National University of Singapore (NUS), Singapore. He received the B.E., M.S., and Ph.D. degrees from the Department of Automation, University of Science and Technology of China, Hefei, China, the Department of Automation, Shanghai Jiao Tong University, Shanghai, China, in 2009, and the Department of Electrical and Computer Engineering, NUS, in 2006, 2009, and 2013, respectively. His research interests include computer vision and pattern recognition. He was a recipient of the Best Student Paper Awards at PREMIA’12, and the winner prizes of the classification task in both PASCAL VOC’10 and PASCAL VOC’11 and the honorable mention prize of the detection task in PASCAL VOC’10.

Lin Sun received the B.S. degree in electronic and information engineering from the Harbin Institute of Technology, Harbin, China, in 2010, and the M.Phil. degree in electronic and computer engineering from the Hong Kong University of Science and Technology, Hong Kong, in 2012, where he is currently pursuing the Ph.D. degree. After receiving the M.Phil. degree, he was with the National University of Singapore, Singapore, as a Research Engineer, and with Lenovo Group Ltd., Hong Kong, as a Researcher. His research interests include video compression, deep learning, and computer vision.

3427

Bin Dai received the Ph.D. degree in control science and engineering from the National University of Defense Technology (NUDT), Changsha, China, in 1998. He was a Visiting Scholar with the Intelligent Process Control and Robotics Labora at Karlsruhe Institute of Technology in 2006. He is currently a professor in the College of Mechatronic Engineering and Automation at NUDT. Prof. Dai’s research interests include pattern recognition, computer vision, autonomous vehicle and intelligent systems.

Shuicheng Yan is currently an Associate Professor with the Department of Electrical and Computer Engineering, National University of Singapore (NUS), Singapore, and the Founding Lead of the Learning and Vision Research Group. His research areas include machine learning, computer vision, and multimedia. He has authored and co-authored hundreds of technical papers over a wide range of research topics, with the Google Scholar citation of more than 12 000 and an H-index of 49. He has been serving as an Associate Editor of the IEEE T RANSACTIONS ON K NOWLEDGE AND D ATA E NGINEERING, the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY (TCSVT) and ACM Transactions on Intelligent Systems and Technology. He was a recipient of the Best Paper Awards from ACM MM’13 (Best Paper and Best Student Paper), ACM MM12 (Best Demo), PCM’11, ACM MM’10, ICME’10, and ICIMCS’09, the runner-up prize of ILSVRC’13, the winner prizes of the classification task in PASCAL VOC from 2010 to 2012, the winner prize of the segmentation task in PASCAL VOC in 2012, the honorable mention prize of the detection task in PASCAL VOC in 2010, the TCSVT Best Associate Editor Award in 2010, the Young Faculty Research Award in 2010, the Singapore Young Scientist Award in 2011, and the NUS Young Researcher Award in 2012.

Decomposition and extraction: a new framework for visual classification.

In this paper, we present a novel framework for visual classification based on hierarchical image decomposition and hybrid midlevel feature extraction...
8MB Sizes 5 Downloads 3 Views