2404

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 8, AUGUST 2015

Modeling Neuron Selectivity Over Simple Midlevel Features for Image Classification Shu Kong, Zhuolin Jiang, Member, IEEE, and Qiang Yang, Fellow, IEEE

Abstract— We now know that good mid-level features can greatly enhance the performance of image classification, but how to efficiently learn the image features is still an open question. In this paper, we present an efficient unsupervised midlevel feature learning approach (MidFea), which only involves simple operations, such as k-means clustering, convolution, pooling, vector quantization, and random projection. We show this simple feature can also achieve good performance in traditional classification task. To further boost the performance, we model the neuron selectivity (NS) principle by building an additional layer over the midlevel features prior to the classifier. The NS-layer learns category-specific neurons in a supervised manner with both bottom-up inference and top-down analysis, and thus supports fast inference for a query image. Through extensive experiments, we demonstrate that this higher level NS-layer notably improves the classification accuracy with our simple MidFea, achieving comparable performances for face recognition, gender classification, age estimation, and object categorization. In particular, our approach runs faster in inference by an order of magnitude than sparse coding-based feature learning methods. As a conclusion, we argue that not only do carefully learned features (MidFea) bring improved performance, but also a sophisticated mechanism (NS-layer) at higher level boosts the performance further. Index Terms— Mid-level feature, neuron selectivity, structural sparse coding, feature learning, image classification.

I. I NTRODUCTION

I

MAGE classification performance relies on the quality of image features [36]. The low-level features include the common hand-crafted ones such as SIFT [29] and HOG [11], and the learned ones from the building blocks in an unsupervised model, such as Convolutional Deep Belief Networks (CDBN) [28] and Deconvolutional Networks (DN) [42]. Then, mid-level features can be generated over these low-level ones to improve the performance through further operations, such as sparse coding and pooling [3], [21]. Manuscript received June 4, 2014; revised October 12, 2014 and January 22, 2015; accepted March 22, 2015. Date of publication March 26, 2015; date of current version April 29, 2015. The work of Q. Yang was supported by the China National Grant Fundamental Research (973 Program) of China under Project 2014CB340304. The associate editor coordinating the review of this manuscript and approving it for publication was Mr. Pierre-Marc Jodoin. S. Kong was with the Noah.s Ark Laboratory and Hong Kong University of Science and Technology, Hong Kong. He is now with University of California, Irvine, CA 92697 USA (e-mail: [email protected]). Z. Jiang is with the Noah’s Ark Laboratory, Hong Kong (e-mail: [email protected]). Q. Yang is with the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2417502

To learn the mid-level features, current methods usually make use of a hierarchical architecture [42], in which each layer accumulates information from the layer beneath to form more complex features. Despite their similarity, they mainly differ in the design of nonlinearity, which is a crucial part for good classification performance. Spatial pyramid matching (SPM) based methods [3], [37], [40] apply sparse coding and max-pooling for the nonlinearity. DN focuses on sparse coding, pooling and unpooling for the nonlinearity [42]. CDBN uses sparse coding and quasi max-pooling [28]. Predictive Sparse Decomposition (PSD) further introduces nonlinear absolute value rectification and local contrast normalization [21]. Inspired by these methods, we present a simple mid-level feature learning approach (MidFea) that merely consists of k-means, convolution, max-pooling, vector quantization (VQ) and random projection as shown in Fig. 1. The nonlinearity in our approach is reflected by the convolution with thresholding, max-pooling and VQ. Although these operations are adopted by existing methods [40], [42], ours is a purely feed-forward one that runs faster and does not introduce heavily parameterized functions as opposed to PSD and CDBN. Through comparison with SIFT and HMAX [33] in Section III-B, we explain why our MidFea learns desirable features. According to studies in neuroscience [1], neurons tend to selectively respond to images from specific categories. Hence, we build an additional Neuron-Selectivity (NS) layer over the mid-level features as demonstrated in Fig. 1, so that the neurons can be fired selectively and semantically for the images from specific categories. We formulize the NS-layer as a structured sparse learning problem, which supports both bottom-up inference and top-down analysis (modeling the reconstruction of the image as an explicit function of the resultant codes for inference). The NS-layer resembles a combination of discriminative dictionary learning [22], [24] and autoencoder [19], but it performs at a higher level over our MidFea with sophisticated constraints and can be adopted to other mid-level features, while existing methods run over pixel-level image to learn low-level and midlevel features. Through extensive experiments, we show that performance improvements brought by the simple MidFea and the NS-layer, and argue that not only does a kind of carefully learned feature (MidFea) bring improved accuracy, but also a sophisticated mechanism (NS-layer) at higher level boosts the performance further. In summary, our contributions are two-fold. (1) We propose a simple and efficient method to learn mid-level features, and give the explanation why the proposed approach generates

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

KONG et al.: MODELING NS OVER SIMPLE MIDLEVEL FEATURES FOR IMAGE CLASSIFICATION

2405

Fig. 1. Left panel: the flowchart of the proposed framework. Specifically, the proposed MidFea learns mid-level features in a feed-forward manner, then the Neuron Selectivity (NS) layer transforms the features into a high-level semantic representations which are fed into the linear classifier. Right panel: demonstration of local feature descriptor with 2 × 2 window. With the help of 3D max-pooling, our local descriptor captures the salient orientations within a cuboid.

desirable features. (2) We model the neuron selectivity principle over the mid-level features by the NS-layer to boost the performance. The NS-layer is a general layer that supports both top-down analysis and bottom-up inference, which is an appealing property in real-world application. Through extensive experiments, we demonstrate our framework not only achieves comparable or even state-of-the-art results on public databases, but also runs faster by an order of magnitude than the related sparse coding based methods. We begin with describing our mid-level feature learning approach in Section III, followed by the proposed NS-layer in Section IV. We then present the experimental validation in Section V and conclude in Section VI. II. BACKGROUND The concept of mid-level features is first introduced in [3], meaning that features built over low-level ones remain close to image-level information without any need for high-level structured image description. Traditionally, the mid-level features [3], [37], [40] are learned via sparse coding techniques over low-level hand-crafted ones (SIFT [29] and HOG [11]). However, despite the improved results in accuracy, extracting the low-level descriptors lacks flexibility and requires significant amounts of domain knowledge, human labor, and heavy computation. As a result, researchers search for alternatives to adaptively learn the features for the system to be both efficient and effective. Some impressive unsupervised feature learning methods have been developed such as CDBN [28], DN [42] and autoencoders [19]. Despite their differences, however, empirical validation confirms several aspects [8]. First, the nonlinearity plays a central role in mid-level feature learning for improved performance [21]. Second, in face of complicated feature learning methods, simpler algorithms can outperform these complex ones by seriously considering operational factors like more densely extracted local descriptors [8]. Third, the low-level features learned by these methods consistently resemble Gabor filters, and even the simplest k-means can produce those similar extractors [8]. These studies motivate us to present a simple and effective approach to learn features from low-level to mid-level. Our MidFea learns the features in a bottom-up and

unsupervised fashion. It consists of soft convolution (sConv), 3D max-pooling, local feature descriptor, and mid-level feature generation as shown in Fig. 1. The sConv performs in a bottom-up manner with nonlinear thresholding; this is different from Convolutional Neural Networks that involves a sigmoid function [25] and adaptive DN that uses convolution in a timeconsuming top-down manner [42]. Moreover, 3D max-pooling is also adopted in adaptive DN which calculates the sparse feature maps through convolutional sparse coding [42]. This means the maps have negative values that are hard to interpret, hence adaptive DN only considers the absolute values. In contrast, our method provides non-negative elements for all feature maps which are more interpretable w.r.t capturing statistic information for the orientations. In literature, the mid-level features are usually fed into a linear or nonlinear classifier for classification [27], [40], [42]. Among the methods, Jiang et al. propose a label-consistent dictionary learning method (LC-KSVD) to jointly learn a discriminative dictionary and a linear regressive classifier for classification [22]. As the learned dictionary can encode the features which may not be linearly separable, a simple jointly learned linear classifier suffices to achieve impressive results over the sparse codes of features. However, as LC-KSVD does the sparse coding in a top-down analytical manner, it requires much time to classify a query image. Additionally, some methods learn to predict sparse codes in a bottom-up manner based on the given signal with a nonlinear function [21], [23]. These methods are currently used to produce sparse features over image patches at low level. Inspired by these works, we propose a neuron selectivity (NS) layer, which jointly considers top-down influence of the classification task and bottomup inference, thus equipped with the power of discriminative dictionary learning and supporting faster classification for the query images. Note that our method also learns to approximate the sparse codes in a bottom-up manner, but it works at a higher level. Because the input is the mid-level features, other than the image patches. III. M ID -L EVEL F EATURE L EARNING As discussed previously, our MidFea consists of soft convolution (sConv), 3D max-pooling, local feature descriptor, and mid-level feature generation. In this section, we elaborate

2406

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 8, AUGUST 2015

Fig. 2. Demonstration of soft convolution: (a) nine low-level filters learned by k-means over the face dataset [15]; (b) three images of the same person under different illumination conditions; (c) three convolutional feature maps of each image displayed in each row with three different filters; (d) normalized maps over (c); (e) thresholded maps over (d); and (f) normalized maps over (e).

these steps and explain the advantages of each step with discussion. A. The Proposed MidFea Soft Convolution: Our MidFea first runs k-means clustering over image patches of the training set to generate the low-level feature extractors. Once the filters are obtained, we simply convolve the image with these filters to get the feature maps, which can be seen as a 3rd-order tensor in left panel of Fig. 1. This is different from existing methods that embed convolution into a nonlinear sigmoid function [28] or analytical sparse decomposition [21], [42]. It is also worth noting that, different from the simple convolution, ours is a soft convolution (sConv) that adaptively generates sparse feature maps. sConv consists of several steps, as demonstrated by Fig. 2 (full maps are placed in the supplementary material): 1) convolving the image with the low-level filters and generate a set of feature maps; 2) normalizing all maps along the third mode into a comparative range, say [0, 1]; 3) thresholding the maps element-wisely with their mean map1 ; 4) normalizing the previous maps along the third mode again for the sake of subsequent operations. There are several advantages in sConv. First, its convolutional behavior just equals to exhaustively dealing with all possible patches in the image [8], and amounts to extracting local descriptors in the densest way. Second, normalization along the third mode preserves local contrast information by counting statistic orientations, thus making the resultant maps more resistant to illumination changes, as shown in Fig. 2. Third, the thresholding operation filters out trivial information or background noises, which can be visually seen in Fig. 6 through comparison with dense SIFT feature maps. It is also worth noting that sConv resembles SIFT in capturing local information. Specifically, SIFT produces responses to different orientations according to sharp edges in the local patch, but our sConv can do more than orientation detection. Fig. 6 (a) displays the learned filters which are quite similar to SIFT in capturing orientations, but when learning the filters over face dataset, as shown in Fig. 2 (a), we can see the filters actually capture more adaptive patterns of the face data, such as curved boundary of cheek. This further indicates that 1 We observe that using the mean maps for thresholding consistently produces decent results.

the flexibility and adaptability of sConv make it suitable for feature generation over specific dataset. 3D Max-Pooling: We adopt the 3D max-pooling operation [42] to obtain further robustness over these sConv feature maps. Suppose we have 9 filters that generate 9 feature maps, then 3D max-pooling operates in a cuboid with size 2 × 2 × 2, meaning non-overlapping 2 × 2 neighborhood between every pair of previous feature maps at the same location. The pooling leads to a single value that is the maximum within the volume. In a macro perspective, we get 36 new maps, each of which has the half size compared with that of the previous ones. This simple operation not only reduces the size of the feature maps, but also further captures the most salient information in a 3D neighborhood by eliminating trivial orientations. It is worth noting that DN [42] also uses 3D max-pooling for nonlinearity. However, DN is a top-down analysis method which requires more time to derive the feature maps by sparse coding. Ours is a feed-forward one and thus performs faster. Please note that our 3D max pooling is done in the local features of a neighborhood patch to realize local invariance. This is different from methods of ScSPM [40] and LLC [37], in which max pooling is only done on the sparse codes to achieve local robustness. Additionally, we also use max pooling over the codes in mid-level feature generation. Local Feature Descriptor: Low-level local descriptor is now constructed over the resulted 36 sConv feature maps, as demonstrated by the right panel of Fig. 1. By sliding 2 × 2 window over each feature map, we can produce 4 times more maps.2 Hence, for the 36 feature maps, we can generate 144 new ones. To have a better perception of these feature maps, please recall the SIFT feature maps in sparse coding based SPM (ScSPM) [40]. If we densely extract SIFT descriptors for patches centered at every pixel, then we generate 128 feature maps, each of which has the same size with the image. Our local feature construction process in a small neighborhood amounts to jointly encoding local features by vector quantization (VQ), thus bringing local invariance. Please note that, in ScSPM [40], SIFT descriptors concatenate 16 bins and generate 128D vector to represent a local patch centered at a pixel, and it is also proved in [3] that concatenating more SIFT bins will improve the performance. In practice, our results are also consistent with this observation that, when we concatenate more bins of our low-level feature, we get better result. Even if the window size to concatenate 2 Hereafter, we ignore the boundary effect for presentational convenience.

KONG et al.: MODELING NS OVER SIMPLE MIDLEVEL FEATURES FOR IMAGE CLASSIFICATION

features can be set as a parameter in practice for better performance, we use 2 × 2 window throughout our experiments. Moreover, when we only concatenate 4 SIFT bins in ScSPM, the performance is very bad. This further shows that our low-level features work better on classification tasks. Mid-Level Feature Generation: We encode the descriptors by vector quantization (VQ) over a codebook learned by k-means. Then, we use max-pooling on the VQ codes in predefined partitions of the image,3 and concatenate pooled codes into a large vector as the image representation. Essentially, we can also replace with sparse coding as done in ScSPM and LLC, but we use VQ for two reasons. First, sparse coding is very time-consuming, and speed is one focus of our work. Second, even though it is proved that sparse coding improves classification accuracy over VQ [4], we do not get better result in our model. This is consistent with our previous work [22]. The reason may be that the following layer will compensate the accuracy loss, and in [22], we also built one additional layer by dictionary learning. As the concatenated vectors usually have more than ten thousands dimensions,4 they can hamper the subsequent stage in training the neural selectivity layer (described in next section). For this reason, we use random projection [35] for dimensionality reduction, and normalize the reduced vectors to have unit length. One example benefitted from random projection is the random face as used in [22]. Please note that there are many choices on dimensionality reduction, but we empirically observe that the learning-free random projection does not incur much accuracy loss. This can be validated in Section V-E. B. Discussion In contrast to the hand-crafted low-level descriptors such as SIFT and HMAX, ours are learned adaptively within the data domain in an unsupervised manner. Despite the main difference, our model shares similarities with these hard-wired features. For example, SIFT captures eight fixed orientations over image patches, while ours not only can do this, but also captures subtle and important information due to its adaptivity and flexibility in the learning process. Moreover, our descriptor also resembles the HMAX feature, which is derived in a feed-forward pathway and incorporate convolution and max-pooling. But HMAX is built in a two-layer architecture and has no sparsity in the feature maps, while ours produces more complicated and more resilient features in a deeper architecture with soft convolution. Remember that the adaptive feature learning methods also produce filters at low level to capture edge information at locals. Therefore, we can say both hand-crafted and adaptive learned features actually count the statistical information in local patches as the low-level features; and we may not need to incorporate the low-level feature learning process in the whole classification pipeline. 3 For example, the image for object categorization is partitioned in spatialpyramid scales of 2l × 2l (for l = 0, 1, 2) [40]. The partitions are different for different tasks, details are presented in experiments. 4 If we adopt spatial-pyramid scales of 2l ×2l (for l = 0, 1, 2), and codebook of size 1000 for VQ, we concatenate 1+22 +42 = 21 features, each of which is of size 1000. Then the total mid-level feature dimension is 21, 000.

2407

IV. N EURON S ELECTIVITY L AYER As our mid-level features are generated in a purely unsupervised manner, for the sake of classification, we propose to build an additional supervised layer over these features to boost the performance. This layer models the neuron selectivity principle in neuroscience [1], which means that certain neurons actively respond to the images from a specific category, while others stay unfired. Therefore, we call this layer Neuron Selectivity (NS) layer, and feed its output into the classifier for classification. Let xi ∈ R p denote the midlevel feature of the i t h input image, which belongs to one of C classes. We would like to build an NS-layer with d neurons. Then, the NS principle can be mathematically modeled as a structured sparse learning problem. A. Bottom-Up Inference and Top-Down Analysis Given a specific mid-level feature xi , these NS-layer neurons selectively respond to the feature xi , and generate a set of activations hi ∈ Rv . We turn to a bottom-up encoder function f W,b (xi ) to derive the activations hi , where the filter W ∈ Rv× p and b ∈ Rv . In this paper, we use the sigmoid function to generate element-wise activations: hi = f W,b (xi ) ≡ (1 + exp(−(Wxi + b)))−1

(1)

Now, we also borrow a top-down feedback analysis (decoder) with the inspiration in neuroscience [10] and successful applications in computer vision field [22], [42]. Since this top-down analysis process explicitly reconstructs the mid-level feature of the image through a function from the activations, more reliable activation with discriminativeness can be derived. In this paper, we choose the simple linear decoder: xi ≈ Dhi ,

(2)

where D = [d1 , . . . , dv ] ∈ R p×v is the weight matrix that controls the linear decoder. Now, we unify the top-down analysis and bottom-up inference into one formulation with appropriate constraints on D and H as below: min

D,H,W,b

X − DH2F + αH − f W,b (X)2F ,

s.t. constraints on D and H,

(3)

where X = [x1 , . . . , x N ] ∈ R p×N stacks all the N training data in one matrix, H = [h1 , . . . , h N ] ∈ Rv×N is the corresponding activations, and α balances the effect of the two processes. Note that the input mid-level features are normalized and the encoder function is bounded in the range [0, 1]. Therefore, without losing generality, by considering the decoder as a linear combination of bases in D to reconstruct the mid-level features, we constrain the columns in D to have unit Euclidean length, i.e. di 22 = 1, ∀i = 1, . . . , v. Note that this layer is built to model the neuron selectivity principle, i.e. a particular set of neurons should actively respond to images from a specific class while others stay unfired. This means activations H should have categoryspecific patterns for different class labels. To this end, we now

2408

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 8, AUGUST 2015

enforce a class-specific constraint on H to reflect the structural patterns and discriminativeness of the activations. In other words, this property can be modeled as a structured sparse learning problem. However, instead of explicitly allocating the neurons to each class [22], [24], [41], we implicitly model this property by imposing an 2,1 -norm to adaptively zero out the rows in submatrix of H = [H1 , . . . , Hc , . . . , HC ]. Concretely, for all the Nc features of the ct h category = [H(1); . . . ; Hc(v)] ∈ Rv×Nc , we impose Hc v c ( j) Hc 2,1 = j =1 Hc 2 . Besides, the activations from the same class should resemble each other, while those from different classes ought to be as different as possible. To this end, for the activations from the same class, say the ct h class denoted by Hc , we ¯ c 2 , where force them to be similar by minimizing Hc − H F ¯ c is the mean vector matrix (by taking the mean vector of H activations Hc as its columns) of activations from the ct h class. At the same time, to differentiate the activations from different classes [24], [32], we drive the activations to be indepenC T H 2 , where H dent at class level by minimizing c /c F c=1 H/c = [H1 , . . . , Hc−1 , Hc+1 , . . . , HC ]. Taking the constraints on H as a Lagrangian multiplier, we have:

classification performance in an analytical manner.5 Moreover, since we can use the joint sigmoid encoder to fast approximate the sparse codes, all the elements of the codes turn to be non-negative. This is a desirable property that models the intuition of combining parts to form a whole, in contrast to the classic sparse coding which includes negative elements to cancel each other out [20].

φ(H) =

C  

 ¯ c 2F + γ HcT H/c 2F , λHc 2,1 + βHc − H

c=1

(4) where parameters λ, β and γ control each penalty term. The constraint defined by φ(H) has several meanings. With the help of the first term, a sufficiently large γ decays the third term to zero. It means the neurons automatically break into C separate parts, and each part corresponds to only one specific class. When β is large enough, the neurons will respond to stimuli from the same class in an identical behavior. This means the second term enforces the mechanism to be a strong classifier. Instead of forcing the penalty so rigorously, we set the three parameters in φ(H) to proper values, and allow: (1) the intra-class variance to be preserved to prevent overfitting in training process, and (2) a few neurons to be shared across categories so that the combination of fired neurons ensures both discrimination and compactness [24]. Now, we arrive at our final objective function with the Lagrangian multiplier φ(H): min X − DH2F + αH − f W,b (X)2F + φ(H)

D,H,W

s.t. di 22 = 1, ∀i = 1, . . . , v.

(5)

Each variable in Eq. 5 can be alternatively optimized through gradient descent method by fixing the others. The detailed derivation and optimization are in the appendix. B. Discussion Mathematically, the proposed NS-layer formulized by Eq. 5 can be seen as a fast inference for (structured) sparse coding [21], [23] and regularized linear autoencoder. However, our decoder term with the constraints can be seen as discriminative dictionary learning [22], [24], [41], which will improve the

V. E XPERIMENTS In this section, we run extensive experiments to evaluate the proposed MidFea and NS-layer in feature learning and image classification performance. First, we study our MidFea and NS-layer on classification accuracy gains in a controlled way, and demonstrate both a carefully learned mid-level feature and a sophisticated mechanism at a higher level boost the performance. Then, we carry out classification comparisons on four datasets for different tasks to show the effectiveness of our model. Moreover, we highlight the efficiency of our framework according to the inference time for object categorization. Finally, we discuss some important parameters involved in our model. In the experiments for classification, we use a subset of AR database [30] for face recognition and gender classification with standard experimental configurations [41]. AR database consists of 50 male and 50 female subjects, and each subject has 14 images (resized to 66 × 48) captured in two sessions with illumination and expression changes. For face recognition, the first 7 images from Session 1 and the first 7 ones from Session 2 of each person are used for training and testing, respectively; while for gender classification, the first 25 male and the first 25 female individuals are used for training and the rest for testing. We also test our framework on age estimation over the FG-NET database [14] with images (resized to 60 × 60) spanning the age from 0 to 69. Consistent with the literature, we use leave-one-person-out setting for the evaluation, and the criterion is Mean Absolute Error (MAE) [26].6 Finally, we evaluate our framework on object categorization over Caltech101 [13] and Caltech256 [17]. Throughout the experiments, we use the linear SVM toolbox [5] as the classifier, and choose the same settings to learn the low-level features, i.e. adaptively learning 9 filters in each database for soft convolution and each one is with size 7 × 7. The dimensions for random projection are determined by empirical testing that no noticeable drops in accuracy appear for all methods. But the partitions for spatial pooling are different for different tasks, we demonstrate this along with the experiments. Moreover, we use the classification accuracy for face recognition, gender classification and object categorization, and the MAE for age estimation. A. Accuracy Gains by MidFea and NS-Layer To demonstrate the superiority of our MidFea and the NS-layer, we compare our model in a controlled way with 5 Analytical manner means the reconstruction of input signal can be considered as an explicit function of sparse codes. 6 MAE is used to quantitatively measure how close predictions are to the real value,  say f i is the real value while yi is a predicted value, M AE = 1/n ni=1 | f i − yi |.

KONG et al.: MODELING NS OVER SIMPLE MIDLEVEL FEATURES FOR IMAGE CLASSIFICATION

2409

TABLE I A CCURACIES (%) OF FACE R ECOGNITION AND G ENDER C LASSIFICATION ON THE

AR D ATABASE . W E R ERUN T HESE M ETHODS U NDER O UR

E XPERIMENT S ETUP FOR FAIR C OMPARISON AND R EPORT T HEIR R ESULTS , S O THE R ESULTS A RE D IFFERENT F ROM T HOSE IN T HEIR O RIGINAL PAPER

Fig. 3. Demonstration of the proposed MidFea and the NS-layer in accuracy gains by comparison with self-taught learning (ST), which can be seen as a three-layer network with the sparse codes as the mid-level features. The x-axis indicates the amount of unlabeled data used for learning bases by ST. The number in bracket shows the dimensions of the mid-level features in ST.

self-taught (ST) learning method [31], which can be seen as a three-layer network with the sparse codes as the midlevel features. Face recognition over AR database is used for the comparison, and tens of thousands face images (with alignment and rescale) downloaded from the internet are used for unsupervised feature learning. For ST, we vary the number of dictionary bases from 200 to 1200 and the number7 of unlabeled face images from 0 up to 30, 000. We record in Fig. 3 the classification accuracies, as well as that obtained by linear SVM on the raw image. From the figure, we can see, consistent with the literature, more neurons (bases) lead to better performance, and more unlabeled data learns more reliable dictionary for ST. But when sufficient unlabeled data are available to learn the dictionary with a certain amount of bases, the accuracy will eventually saturate. Please note the fluctuation of the curves is due to real-world randomness in selecting training data. However, when we add our NS-layer over the mid-level features produced by ST, a notable gain is obtained. This illustrates that the NS-layer s accuracy gains at higher level with simple sparse codes as the mid-level features. Moreover, when the features are generated by ou MidFea (with 500 codewords for VQ and a single layer of 3 × 3 partition for spatial pooling), much better performance is achieved. This explains the advantage of our learned mid-level features over the simple sparse codes. With no surprise, once NS-layer is further built over our MidFea (MidFea-NS), the best performance is recorded. This experiment demonstrates that both a good learned feature (MidFea) and a well-designed mechanism (NS-layer) at higher level can boost the performance. B. Facial Attributes Recognition We now evaluate our model on facial attributes recognition: face recognition and gender classification on AR database, and age estimation on FG-NET database. The linear SVM on the raw image acts as the baseline (SVM-raw). 70 means no unlabeled data available. In this case, a random matrix is used as the dictionary.

Fig. 4. Left panel: original images from AR dataset and the corresponding neurons at NS-layer for face recognition. Right panel: the neurons learned for gender classification.

For fair comparison on AR database, we choose several state-of-the-art methods as their source codes are online available, including SRC [38], FDDL [41], LC-KSVD [22] and LLC [37]. We use the random face [38] (300-dimension) as the input for the first three methods to reproduce the results. For LLC its codes are reduced to 300-dimension with random projector before fed into the linear SVM. For face recognition and gender classification, our MidFea learns mid-level features with 500 codewords for VQ and a single layer of 3×3 partition for spatial pooling. Moreover, our NS-layer learns 300 and 10 neurons for the two tasks, respectively. Detailed comparisons are listed in Table I, and some learned neurons w.r.t the two tasks are displayed in Fig. 4.8 From Table I, we can see with the proposed MidFea and NS-layer, the performance outperforms the compared ones. Furthermore, as shown in Fig. 4, even the two tasks share the same database, the learned neurons through the NS-layer capture specific characteristics according to the task. This intuitively demonstrates the reason why the proposed NS-layer works for classification. Additionally, we use FG-NET database for age estimation. Several state-of-the-art methods are compared here, including AGES [14], RUN [39], OHRank [6], MTWGP [44], BIF [18], and the recent CA-SVR [7]. Except for AGES, all the methods use the images with Active Appearance Model [9]. We generate 500-word codebook for VQ in MidFea, and define the 8 To display the neurons, hereafter we use the PCA for the dimensionality reduction other than random projection. Moreover, for the sake of demonstration, the spatial pooling is waived here and the neurons are averaged w.r.t one image and then projected back to the input space.

2410

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 8, AUGUST 2015

TABLE II MAE OF A GE E STIMATION ON THE FG-NET D ATABASE

TABLE III A CCURACIES AND I NFERENCE T IME OVER C ALTECH 101. F OR T IMING C OMPARISON , THE M ID -L EVEL F EATURES OF A LL M ETHODS A RE R EDUCED TO 3000 D IMENSION BY R ANDOM P ROJECTION . W E O BSERVE T HIS R EDUCTION D OES N OT E FFECT THE A CCURACIES FOR T HESE M ETHODS . T HE T IME IN B RACKET I S A CHIEVED BY I MPROVED SIFT E XTRACTION M ETHOD IN [27]

Fig. 5. Neurons learned on the FG-NET for age estimation. The response values reveal that different ages do have an association to specific neurons, and we can see the neurons selectively response to the facial textures in older people as they have more wrinkles. TABLE IV

partition with a single layer of 8 × 8-pixel overlapping grids for spatial pooling. The results listed in Table II demonstrate our model performs slightly better than the best performance ever reported by the recent CA-SVR, which is sophisticatedly designed to deal with imbalanced data problem, e.g. there are very few images of 60 years old and above. OHRank also deals with sparse data, but performs very slow as showed in [7]. The BIF method resembles ours as it uses the (hand-designed) biologically-inspired feature [33] for the mid-level features, which, however, are generated in a shallower architecture. The neurons shown in Fig. 5 demonstrate our model reveals the age information through the wrinkles on face. C. Object Categorization We now evaluate our model on Caltech101 and Caltech256 for object categorization. For both databases, we randomly select 30 images of each category for training and the rest for testing, and each image is resized to no larger than 150 × 150-pixel resolution with preserved aspect ratio. The compared methods include both recent unsupervised feature learning methods and well-known methods with hand-crafted SIFT features. The former methods include CDBN [28], adaptive DN [42], and PSD [21]. The latter include the KSPM [27], ScSPM [40] and LLC [37]. For all the methods, mid-level features are generated with a 1000-word codebook for VQ or sparse coding, then reduced to 3000 dimension by random projection. In our model, the NS-layer learns 2040 and 5120 neurons in total for the two database respectively, assuming an average of 20 neurons associate to one specific category. The classic 3-layer-pyramid partition (l = 0, 1, 2) for pooling is used. Detailed comparisons are listed in Table III and Table IV for Caltech101 and Caltech256, respectively. It is easy to see that our method outperforms those with the SIFT descriptor. Most importantly, the inference speed of our model is much faster than the compared ones by an order

A CCURACIES OF O BJECT C ATEGORIZATION ON C ALTECH 256. N OTE T HAT W E R ERUN LLC ON T HIS D ATABASE U NDER THE C URRENT E XPERIMENT C ONFIGURATION , S O THE R ESULT I S D IFFERENT F ROM T HAT IN I TS O RIGINAL PAPER

of magnitude. Actually, we can stack the SIFT descriptor of every possible patch in one image as 128 feature maps in ScSPM. Therefore, we can compare the feature maps between ScSPM and ours to intuitively see the superiority of our model. Fig. 6 (a) displays the learned low-level feature extractors, (c) shows some feature maps (full feature maps are in supplementary material) of four images in (b). Furthermore, we average all the feature maps and show the averaged one in the last column of panel (c). It is easy to see that SIFT feature incorporate more cluttered background, while ours focuses more on the object and discards the noisy region to a large extent. We attribute this to the proposed soft convolution step. We also adopt the NS-layer to the CNN [25] on Caltech101, and achieve 90.83% ± 0.36 accuracy. This is 5% higher than the 85.75% ± 0.52 by CNN alone.9 Admittedly, better performances are reported in literature [3], [12], [16], [34]. But these methods turn to SIFT feature (Microfeature) [3], [16], discriminative dictionary 9 To conduct classification on Caltech101 dataset by CNN, we use DeCAF toolbox [12] which is public available. We try the outputs of different layers and use the linear SVM as the classifier. Consistent with results in [12], the output at the sixth layer leads to the best result.

KONG et al.: MODELING NS OVER SIMPLE MIDLEVEL FEATURES FOR IMAGE CLASSIFICATION

2411

Fig. 6. Visual comparison of local descriptor feature maps of our model and SIFT (as adopted in ScSPM [40]) for three images from Caltech101. The nine learned filters learned in this database are presented in (a). (b) Shows the four original images, whose feature maps generated by our model (upper row) and SIFT (bottom row) are presented in panel (c). Note that the last image in each row is the average of all feature maps. From the averaged map, we can see the SIFT map distributes attention uniformly over the image, while ours mainly focuses on the object.

TABLE V C OMPARISON OF D ETAILED I NFERENCE T IME (s). VQ, SC AND SP S TAND FOR V ECTOR Q UANTIZATION , S PARSE C ODING AND S PATIAL P YRAMID R ESPECTIVELY. F OR FAIR OF C OMPARISON , THE M ID -L EVEL F EATURES OF

A LL THE M ETHODS A RE R EDUCED 3000 D IMENSION BY R ANDOM P ROJECTION

TO

learning for sparse coding [3], [16], intersection kernel [3], four times larger codebook for sparse coding [34], training features with large-scale auxiliary data like ImageNet and within much deeper architecture [12]. However, our model achieves comparable results to these sparse coding based feature learning methods by simply learning the feature with much less parameters. D. Inference Timing on Object Categorization To highlight the efficiency of our framework, we study the inference time for a 150 × 150-pixel image on MATLAB in a PC with dual-core CPU, 2.50GHz, 32-bit OS and 2GB RAM. We record in Table V the timing in each step of our framework, including soft-threshold convolution, 3D pooling, local feature descriptor, VQ, spatial pyramid (SP) pooling, random projection and the inference with the classifier. As our MidFea generates mid-level features in a bottomup manner, it costs much less time than top-down methods such as adaptive DN [42]. Specifically, adaptive DN needs

more than 1 minute to produce all the feature maps and 2 more seconds with the VQ and kernel classifier. This is much slower than ours by almost two orders of magnitude (see Table V), as it involves multiple iterations for decomposing the image into multi-layer feature maps. Therefore, we focus on comparing our model with three feedforward methods, Kernel SPM (KSPM) [27], ScSPM [40] and LLC [37]. Table V summarizes the detailed comparisons, which demonstrate our method performs much faster than the compared ones by more than one order of magnitude. The three methods extract SIFT descriptors at one-pixel stepsize in the densest way, requiring more than 19 seconds to extract SIFT features in an image of 150 × 150 pixels. Even using fast SIFT extraction approach [27], it is still much slower than our method. Furthermore, ScSPM and LLC adopt sparse coding and locality-restricted coding, hence more running time is required. But it is interesting to see LLC performs much slowly. The reason is that LLC has to perform KNN for each of the SIFT feature. When we extract SIFT feature at the densest way on a 300 × 300-pixel resolution image, we can get 81,225 SIFT features, which means performing KNN for all the features is very time-consuming. But ScSPM can perform sparse coding in a batch manner, so it is faster than LLC. Note that Convolutional Neural Network (CNN) based methods [25], such as CDBN [28], perform fast as they also involve feed-forward process at deeper layers, but ours achieves much higher classification accuracy than CDBN as demonstrated in the following experiments. Considering the main steps of our model are amenable to parallelization and GPU-based implementation, we expect its applications in the real world. E. Parameter Discussion We first discuss the crucial parameters in our model (Eq. 5), including α, β, γ and λ. Moreover, the number of neurons in the NS-layer is also studied. Fig. 7(a) presents the curve

2412

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 8, AUGUST 2015

Fig. 7. (a) The choices of α, β, γ and λ vs. accuracy over AR database for face recognition; (b) The number of neurons in NS-layer vs. accuracy over AR database for gender classification; (c) The reduced dimension by random projection vs. accuracy over Caltech101 database (the red line is the accuracy run over the original 21000D features without reduction.

of accuracy vs. each parameter on AR database for face classification. We tuned all the parameters altogether, i.e. using four for-loops and tune each one within an interval in each loop. To analyze each parameter, we just set the others to specific values that achieve the best classification performance (α = 1.5, β = 2, γ = 2, λ = 4). It is easy to see the classification accuracy is not sensitive to these parameters, and remains stable in a large range of them. As well, these hyper-parameters reveal that the terms in the objective function indeed bring performance gains. Furthermore, we show the accuracies vs. the neuron number in NS-layer on gender classification in Fig. 7(b), as the data is sufficient for this task (with the same setting of gender classification experiment). We can see the accuracy peaks with a small number of neurons, say 6. This demonstrates the effectiveness of NS-layer that serves for classification at a high level. Furthermore, we plot in Fig. 7(c) the curve of accuracy vs. reduced dimension by random project on Caltech 101 database. The results are achieved by feeding the mid-level features obtained by our MidFea and discarding the NS-layer. When running linear SVM over the original 21000D features, we get 74.6% classification accuracy; when we use random projection to reduce 85% dimension, i.e. reducing the feature dimension from 21000 to 3150 = (1 − 0.85) ∗ 21000, we get 73.9% classification accuracy, meaning only 0.8% drops are incurred compared with that over original features. This demonstrates the effectiveness of random projection for dimensionality reduction. Actually, although random projection for dimensionality reduction incurs slight accuracy drops, it benefits the later NS-layer, which will not only compensate the loss of accuracy, but also improves the performance. VI. C ONCLUSION By revisiting current feature learning frameworks in this paper, we present a simple approach to learn mid-level features unsupervisedly and in a feed-forward manner. With comparison of the hand-crafted features, we explain why the proposed MidFea produces the desired features from low-level to mid-level. Moreover, to boost classification performance, we propose to model the neuron selectivity principle by

building a supervised layer at higher level for classification. This NS-layer supports both top-down analysis and fast bottom-up inference. As a result, our model performs faster by an order of magnitude than other methods based on sparse coding, and achieves comparable or even state-ofthe-art experimental results. Based on our analysis and the experimental results, we argue that not only does a kind of carefully learned feature (MidFea) bring improved accruacy, but also a sophisticated mechanism (NS-layer) at higher level boosts the performance further. A future work is to build a deeper architecture based on our MidFea for possible improvement. Moreover, as the proposed NS-layer is a general layer that can be applied to other competing methods, it is worth investigating the performance by adding it to other methods. A PPENDIX O PTIMIZATION AT N EURON S ELECTIVITY L AYER For presentational convenience, we write the proposed objective function in Neuron Selectivity (NS) layer as below: min

D,H,W,b

X − DH2F + αH − fW,b (X)2F +

C    ¯ c 2 + γ HcT H/c 2 λHc 2,1 + βHc − H F F c=1

s.t.

di 22

= 1, ∀i = 1, . . . , d.

(6)

Each variable in the above objective function can be alternatively optimized by fixing the others through stochastic gradient descent method (SGD). Updating D: Specifically, we apply SGD to update D by fixing the others, and its gradient can be easily computed as below: ∇D = −2XHT + 2DHHT.

(7)

Alternatively, when bases number in D is not prohibitively large, we can analytically update D = XHT (HHT )−1 . Then, we normalize each column of D to have unit Euclidean length. Note that this step will not pose any negative affects on the overall performance, as the newly updated H, on which D is only dependent, can adaptively deal with this scaling change.

KONG et al.: MODELING NS OVER SIMPLE MIDLEVEL FEATURES FOR IMAGE CLASSIFICATION

Updating Hc Class by Class: Prior to optimizing Hc , which is the responses in the NS-layer to the data from the ct h class, ¯ c first. Then, we calculate the mean vector and get H we update Hc as:

2413

Algorithm 1 Algorithmic Summary at NS Layer

Hc∗ = argmin Xc − DHc 2F + αHc − f W,b (Xc )2F Hc

T ¯ c − Hc 2F + γ H/c +βH Hc 2F + λH Hc 2,1 ,

(8)

where Xc √ stacks all √data from the ct h class. Let ˜ c = [Xc ; α f W,b (Xc ); β H ¯ c ; 0] ∈ R( p+2d+N−Nc )×Nc , and G √ √ ˜ c = [D; αI; βI; √γ HT ] ∈ R( p+2d+N−Nc )×d , in which Q /c 0 is a zero matrix with appropriate size. We rewrite the above function as: ˜c −Q ˜ c Hc 2 + λH Hc 2,1 . g(Hc ) = G F

(9)

We use SGD to optimize Hc , and the partial derivative of g w.r.t Hc is calculated as: ˜ c + 2Q ˜ c Hc + λH CHc , ˜ cT G ˜ cT Q ∇Hc = −2Q

(10)

where C is a diagonal matrix with its i t h diagonal element as: C[i, i ] =

1 Hc(i) 2

,

(11)

where Hc(i) is the i t h row of Hc . Therefore, Hc can be optimized between solving Eq. 11 and Eq. 10 for a couple of times. Updating W and b: Denote the sigmoid function as f W,b (X) = σ (WX + b1T ) = (1 + exp(−(WX + b1T )))−1 , where 1 is a matrix with all elements equaling 1 and appropriate size. With the newly calculated H, we update W and b as: {W∗ , b∗ } = argmin H − σ (WX + b1T )2F .

(12)

W,b

As W and b cannot be derived directly, we use the SGD to update them. With simple derivations, by denoting the element-wise operation  = σ (WX + b1T ), we have the gradient of W and b as:   ∇W = 2 ( − H)  (1 − ) XT ,   ∇b = 2 ( − H)  (1 − ) 1, (13) wherein “ ” means Hadamard product. Initialization and Algorithmic Summary: Usually, a good initialization for the variables can lead to fast convergence. For example, the linear decoder or the dictionary D can be pre-trained among each category. Let D = [D1 , . . . , Dc , . . . , DC ] ∈ R p×d , in which Dc are the neurons that only response to data from the ct h class. Then, we can merely run k-means clustering among the data pool of class c to obtain Dc . Then, activations Hc can be initialized through the initialized D. Specifically, for a datum x, we calculate a vector z ∈ Rd with its i t h element as: si milari t y(di , x) , (14) z i = Ck j =1 si milari t y(d j , x) 1 in which we can simply define si milari t y(m, n) = m−n . 2 z After obtaining z, we get the initialized h = z2 . With the

initialized H, both W and H can be then pre-trained for their initialization. However, the above initialization method lack flexibility, because the allocation of D to each category must be pre-defined by hand. Therefore, we can also simply initialized all the variables with non-negative random matrices, which serve the purpose of symmetry breaking. Empirically, we observe this random initialization does not mean inferior performance at all, but requires more time to converge. We owe it to that, even through our framework is a deep architecture, the NS-layer only incorporates one hidden layer, hence random initialization works quite well. The overall steps (with gradient descent method) of the NS-layer is summarized in Algorithm 1. ACKNOWLEDGEMENT The authors would like to thank all the anonymous reviewers for their constructive comments and suggestions. R EFERENCES [1] E. L. Bienenstock, L. N. Cooper, and P. W. Munro, “Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex,” J. Neurosci., vol. 2, no. 1, pp. 32–48, Jan. 1982. [2] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighbor based image classification,” in Proc. IEEE Conf. CVPR, Jun. 2008, pp. 1–8. [3] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning midlevel features for recognition,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 2559–2566. [4] Y.-L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. 27th ICML, 2010, pp. 111–118. [5] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, Apr. 2011. [6] K.-Y. Chang, C.-S. Chen, and Y.-P. Hung, “Ordinal hyperplanes ranker with cost sensitivities for age estimation,” in Proc. IEEE Conf. CVPR, Jun. 2011, pp. 585–592. [7] K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute space for age and crowd density estimation,” in Proc. IEEE Conf. CVPR, Jun. 2013, pp. 2467–2474. [8] A. Coates, H. Lee, and A. Y. Ng, “An analysis of single-layer networks in unsupervised feature learning,” in Proc. AISTATS, 2011, pp. 215–223. [9] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 681–685, Jun. 2001. [10] H. D. Critchley, “Neural mechanisms of autonomic, affective, and cognitive integration,” J. Comparative Neurol., vol. 493, no. 1, pp. 154–166, Dec. 2005. [11] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. CVPR, Jun. 2005, pp. 886–893. [12] J. Donahue et al. (2013). “DeCAF: A deep convolutional activation feature for generic visual recognition.” [Online]. Available: http://arxiv.org/abs/1310.1531

2414

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 8, AUGUST 2015

[13] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories,” Comput. Vis. Image Understand., vol. 106, no. 1, pp. 59–70, Apr. 2007. [14] X. Geng, Z.-H. Zhou, and K. Smith-Miles, “Automatic age estimation based on facial aging patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 12, pp. 2234–2240, Dec. 2007. [15] A. S. Georghiades, P. N. Belhumeur, and D. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 643–660, Jun. 2001. [16] H. Goh, N. Thome, M. Cord, and J.-H. Lim, “Unsupervised and supervised visual codes with restricted Boltzmann machines,” in Proc. ECCV, 2012, pp. 298–311. [17] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” California Inst. Technol., Pasadena, CA, USA, Tech. Rep., 2007. [18] G. Guo, G. Mu, Y. Fu, and T. S. Huang, “Human age estimation using bio-inspired features,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 112–119. [19] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. [20] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Dec. 2004. [21] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in Proc. 12th IEEE ICCV, Sep./Oct. 2009, pp. 2146–2153. [22] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent K-SVD: Learning a discriminative dictionary for recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2651–2664, Nov. 2013. [23] K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. Le-Cun, “Learning invariant features through topographic filter maps,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 1605–1612. [24] S. Kong and D. Wang, “A dictionary learning approach for classification: Separating the particularity and the commonality,” in Proc. ECCV, 2012, pp. 186–199. [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. NIPS, 2012, pp. 1097–1105. [26] A. Lanitis, C. J. Taylor, and T. F. Cootes, “Toward automatic simulation of aging effects on face images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 4, pp. 442–455, Apr. 2002. [27] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Comput. Soc. Conf. CVPR, Jun. 2006, pp. 2169–2178. [28] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proc. ICML, 2009, pp. 609–616. [29] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004. [30] A. M. Martinez and R. Benavente, “The AR face database,” CVC Tech. Rep. 24, Jun. 1998. [31] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: Transfer learning from unlabeled data,” in Proc. ICML, 2007, pp. 759–766. [32] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clustering via dictionary learning with structured incoherence and shared features,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 3501–3508. [33] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex,” Nature Neurosci., vol. 2, no. 11, pp. 1019–1025, 1999. [34] K. Sohn, D. Y. Jung, H. Lee, and A. O. Hero, “Efficient learning of sparse, distributed, convolutional feature representations for object recognition,” in Proc. IEEE ICCV, Nov. 2011, pp. 2643–2650. [35] S. S. Vempala, The Random Projection Method, vol. 65. Providence, RI, USA: AMS, 2004. [36] D. Wang, X. Wang, and S. Kong, “Integration of multi-feature fusion and dictionary learning for face recognition,” Image Vis. Comput., vol. 31, no. 12, pp. 895–904, Dec. 2013. [37] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 3360–3367. [38] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[39] S. Yan, H. Wang, X. Tang, and T. S. Huang, “Learning auto-structured regressor from uncertain nonnegative labels,” in Proc. IEEE ICCV, Oct. 2007, pp. 1–8. [40] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 1794–1801. [41] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discrimination dictionary learning for sparse representation,” in Proc. ICCV, Nov. 2011, pp. 543–550. [42] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in Proc. IEEE ICCV, Nov. 2011, pp. 2018–2025. [43] H. Zhang, A. C. Berg, M. Maire, and J. Malik, “SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,” in Proc. IEEE Comput. Soc. Conf. CVPR, Jun. 2006, pp. 2126–2136. [44] Y. Zhang and D.-Y. Yeung, “Multi-task warped Gaussian process for personalized age estimation,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 2622–2629.

Shu Kong received the B.S. degree from Donghua University, in 2010, and the M.S. degree from Zhejiang University, in 2013. He is currently pursuing the Ph.D. degree with the University of California at Irvine. His research interests are in computer vision, pattern recognition and machine learning, and in applications to biological image analysis.

Zhuolin Jiang (M’12) received the Ph.D. degree in computer science from the South China University of Technology, in 2010. He was an Assistant Research Scientist with the Institute for Advanced Computer Studies, University of Maryland, from 2011 to 2012. He is currently a Researcher with the Noah’s Ark Laboratory. His research interests include computer vision, pattern recognition, and machine learning, specifically on sparse coding, dictionary learning, supervised learning, and submodular optimization.

Qiang Yang (F’09) received the Ph.D. degree from the Computer Science Department, University of Maryland, College Park, in 1989. From 1989 to 2001, he was a Faculty Member with the University of Waterloo and Simon Fraser University, Canada. He was the Founding Director of the Noah’s Ark Research Laboratory from 2012 to 2014. He is currently the Head of the Computer Science and Engineering Department with The Hong Kong University of Science and Technology, where he is a New Bright Endowed Chair Professor of Engineering. His research interests are data mining and artificial intelligence, including machine learning, planning, and case-based reasoning. He is a fellow of the Association for the Advancement of Artificial Intelligence, the International Association for Pattern Recognition, and the American Association for the Advancement of Science. He was elected as the Vice Chair of ACM SIGART/SIGAI in 2010, and is currently an Advisor of ACM SIGAI. He was the Founding Editor-in-Chief of the ACM Transactions on Intelligent Systems and Technology, and is the Founding Editor-in-Chief of the IEEE T RANSACTIONS ON B IG D ATA. He is on the Editorial Board of the IEEE Intelligent Systems and several other international journals (the IEEE T RANSACTIONS ON K NOWLEDGE AND D ATA E NGINEERING (2005–2009) and AI Magazine). He served as the PC Co-Chair and General Co-Chair of several international conferences, including ACM KDD 2010 and 2012, ACM RecSys 2013, ACM IUI 2010, and ICCBR 2001. He serves as an IJCAI Trustee, and will be the PC Chair of IJCAI 2015.

Modeling neuron selectivity over simple midlevel features for image classification.

We now know that good mid-level features can greatly enhance the performance of image classification, but how to efficiently learn the image features ...
2MB Sizes 4 Downloads 5 Views