Salient region detection by fusing bottom-up and top-down features extracted from a single image.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2014

4389

Salient Region Detection by Fusing Bottom-Up and Top-Down Features Extracted From a Single Image Huawei Tian, Yuming Fang, Yao Zhao, Senior Member, IEEE, Weisi Lin, Senior Member, IEEE, Rongrong Ni, Member, IEEE, and Zhenfeng Zhu, Member, IEEE

Abstract— Recently, some global contrast-based salient region detection models have been proposed based on only the lowlevel feature of color. It is necessary to consider both color and orientation features to overcome their limitations, and thus improve the performance of salient region detection for images with low-contrast in color and high-contrast in orientation. In addition, the existing fusion methods for different feature maps, like the simple averaging method and the selective method, are not effective sufficiently. To overcome these limitations of existing salient region detection models, we propose a novel salient region model based on the bottom-up and top-down mechanisms: the color contrast and orientation contrast are adopted to calculate the bottom-up feature maps, while the topdown cue of depth-from-focus from the same single image is used to guide the generation of final salient regions, since depth-fromfocus reflects the photographer’s preference and knowledge of the task. A more general and effective fusion method is designed to combine the bottom-up feature maps. According to the degree-ofscattering and eccentricities of feature maps, the proposed fusion method can assign adaptive weights to different feature maps to reflect the confidence level of each feature map. The depth-fromfocus of the image as a significant top-down feature for visual attention in the image is used to guide the salient regions during the fusion process; with its aid, the proposed fusion method can filter out the background and highlight salient regions for the image. Experimental results show that the proposed model outperforms the state-of-the-art models on three public available data sets. Index Terms— Human visual system (HVS), salient region detection, bottom-up and top-down visual attention. Manuscript received July 31, 2013; revised January 9, 2014 and May 27, 2014; accepted August 2, 2014. Date of publication August 22, 2014; date of current version September 5, 2014. This work was supported in part by 973 Program under Grant 2012CB316400, in part by the National Natural Science Funds through the Distinguished Young Scholar under Grant 61025013, in part by the National Natural Science Foundation of China under Grant 61332012, Grant 61402484, and Grant 61272355, in part by the Program for Changjiang Scholars and Innovative Research Team in University under Grant IRT 201206, and in part by the Open Projects Program of National Laboratory of Pattern Recognition under Grant 201306309. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Damon M. Chandler. H. Tian is with the People’s Public Security University of China, Beijing 100038, China (e-mail: [email protected]). Y. Fang is with the School of Information Technology, Jiangxi University of Finance and Economics, Nanchang 330032, China (e-mail: [email protected]). Y. Zhao, R. Ni, and Z. Zhu are with the Institute of Information Science, Beijing Jiaotong University, and also with the Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China (e-mail: [email protected]; [email protected]; [email protected]). W. Lin is with the School of Computer Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2350914

I. I NTRODUCTION ISUAL attention is an important mechanism in HVS. Recently, developing saliency detection models to simulate visual attention has been arousing great interest in the research community. The saliency detection model can be used to address various challenging tasks in visual signal applications, such as image segmentation [1], object recognition [2], content-aware image editing [3]–[5], image retrieval [6]–[8], etc. More applications of it can be found in [9]. Treisman’s highly influential Feature Integration Theory (FIT) [10] provides a two-stage architecture for human vision: the bottom-up, preattentive, data driven stage and the top-down, goal driven, attentive stage. In the first stage, there is a limited set of attributes (color, orientation, motion, etc.) that could be processed in parallel [11]–[13]. Wolfe et al. claimed that color, orientation, and motion are undoubted attributes to guide the deployment of attention [14]. The bottom-up feature maps are obtained based on the contrast of these attributes. Usually, the more a region is different from others, the more attention it will attract. The critical difference between the the two stages is that the second stage requires a serial “binding” steps [15]–[17]. Binding is the act of linking the bits of information, which is strongly biased by the top-down inputs and thus highlights salient regions in bottom-up feature maps from the first stage [13], [15], [18], [19]. Based on the above description, the following steps should be involved in visual attention modeling: 1) the feature contrast calculation based on low-level features such as color, orientation, etc; 2) the fusion process by binding various bottom-up feature maps; 3) the highlighting of salient targets with the aid of top-down features if available. The contrast from low-level features is a significant factor for saliency detection. Based on FIT, many contrast based detection models have been proposed from the respective of local and global contrasts. One of the earliest saliency detection models was proposed by Itti et al. [20]. In their work, the bottom-up saliency map is calculated from the multi-scale center-surrounding differences. Ma et al. proposed a local contrast analysis for calculating saliency map, and a fuzzy growing algorithm is adopted to detect salient regions [21]. Liu et al. combined multi-scale contrast, centersurround histogram, and color spatial distribution to learn a conditional random field for estimating salient regions [22]. Most early works [20], [21], [23], [24] are proposed based on local contrast. The resulting saliency maps of these local contrast based models are usually blurry, and overemphasize

V

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

4390

Fig. 1. From left to right columns: the original images; the basic region-wise depth-from-focus map; the color feature map; the orientation feature map; the final saliency map by fusing color, orientation and depth-from-focus features; and the ground truth. All feature maps and saliency maps in the figure are generated by the method of Section II.

small and purely local regions near edges instead of uniform salient regions. To overcome the drawback of these models, some global contrast based salient region detection models have been reported in [25]–[27], recently. These global contrast based models perform well on highlighting uniform salient regions. However, these models only use the low-level feature of color and neglect others such as orientation feature. Therefore, the performance of them is poor for salient region detection in images with low-contrast for color and highcontrast for orientation, as shown in the first and third rows in Fig. 1. Obviously, a more comprehensive set of features including the contrast of orientation should be adopted. In the second stage of existing visual attention models, the fusion methods for binding different feature maps are not flexible enough and effective sufficiently. In [20], bottom-up contrast based feature maps from the first stage are bound into the final saliency map by simple averaging. In [28], the final saliency map is selected as either orientation feature map or color feature map by identifying which map leads to more correct identification of salient regions. As illustrated in Fig. 1, the orientation or color feature map alone may lead to correct identification of salient regions (the first row and the third row), or they are complementary (the second row and the fourth row). When one feature map leads to correct identification of salient regions, the averaging based fusion method is not effective enough; when both of feature maps are complementary, the selective fusion method will be not effective. Therefore, a more general and effective method should be designed to obtain a better final saliency map, which can assign adaptive weights to different feature maps to reflect the confidence level of each feature map. A strategy of combing individual feature maps based on composite saliency indicator is proposed in [29]. In [30], a combination strategy of adaptive feature selection is designed based on cluster density. Recently, top-down features are explored for salient region detection. In [31], Torralba et al. adopted the context feature as top-down information. Fang et al. used the features of cars as top-down information in [32]. Borji et al. combined low-level features with top-down cognitive visual features (e.g., faces, humans, cars, etc.) and learned a direct mapping from those features to eye fixations using regression, SVM, and AdaBoost classifiers [33]. In [34], top-down features including location, semantic, and color priors are integrated into the saliency


detection model. [35] and [36] use spatial bias, feature bias, context, and task as top-down cues to build visual attention models. As described above, in the second stage of FIT, the final salient regions are strongly biased by top-down features. With top-down features (e.g., car, people, face, text, road signs, affective and emotional stimuli, or actions on scenes), humans’ brain will highlight some specific salient regions among those obtained from the bottom-up mechanism. Generally, it is difficult and time-consuming to learn these specific topdown features for a computational model. Moreover, modeling effective fusion for combining top-down features is also a challenge. In this paper, we make use of the top-down feature of depth-from-focus from a single image to calculate the final salient regions for images. When photographers take a photo, they automatically place the interesting or important targets (in their opinions) in focus. Existing studies have shown that interesting objects are more salient than others within a scene [22], [24]–[27], [37], [38]. Additionally, it is well accepted that human experience or knowledge can be used as top-down information in visual attention modeling [9], [34], [39]. Therefore, the depth-from-focus of the image is a common, convenient, and significant top-down feature for visual attention within a single image. Furthermore, it is much easier to obtain this top-down feature of depth-from-focus than other existing ones such as the learning-based method for specific top-down clues (such as car, face, text, signs). To the best of our knowledge, the top-down feature of depth-fromfocus hasn’t been used in existing saliency detection models. In essence, we propose a novel two-stage salient region detection model using the low-level features of orientation, color and the top-down feature of depth-from-focus extracted from a single image in this paper. The model can be used in tasks of salient region detection from a picture which is taken by human beings. With this kind of images, depthfrom-focus can be considered as top-down information. The framework of the proposed model is illustrated in Fig. 2. In the first preattentive stage, two contrast-based feature maps are computed based on the uniqueness and spatial distributions of orientation and color, respectively. We firstly segment the image into some basic regions for contrast calculation. Then the region uniqueness of orientation and color is computed based on the extracted orientation and color features, respectively. In the second stage, the orientation and color feature maps generated from the preattentive stage are fused to obtain the final saliency map with the guidance of depth-from-focus. The remainder of this paper is organized as follows. Section II describes the proposed salient region detection model. In Section III, we evaluate the performance of the proposed model by the comparison experiments with stateof-the-art methods. The final section concludes the paper. II. T HE P ROPOSED M ODEL Firstly, we use the oversegmentation algorithm in [27] for pre-processing the input image to obtain N basic regions before operating salient region detection. The algorithm is a modified version of SLIC superpixels [40]. SLIC superpixels segment an image using K-means clustering in RGBXY.

TIAN et al.: SALIENT REGION DETECTION BY FUSING BOTTOM-UP AND TOP-DOWN FEATURES EXTRACTED FROM A SINGLE IMAGE

4391

Fig. 3. An example of computing LBP for a pixel in a 3 × 3 neighborhood and orientation descriptor of a basic region in an image.

Fig. 2.

Framework of the proposed salient region detection model.

The algorithm of [27] slightly modifies the SLIC approach and uses K-means clustering in geodesic image distance in CIELab space, which guarantees connectivity, while retaining the locality, compactness and edge awareness of SLIC superpixels. A sample result of oversegmentation is shown in Fig. 2 and Fig. 4(a). Based on these basic regions of oversegmentation, we combine uniqueness measurement with spatial distribution measurement from orientation and color features to generate orientation and color feature maps, respectively. Then, the proposed fusion method will combine the orientation and color feature maps to get the final saliency with the guidance of depth-from-focus. The framework of the proposed model is illustrated in Fig. 2, with the associated equations being indicated for readers’ conveniences. Fig. 4 illustrates the samples of main phases of our model. A. Color Feature Map We compute the color uniqueness, color distribution, and color feature map based on the study [27]. The color uniqueness of a basic region i is defined as a function of the region position pi and color feature ci in CIELab color space: UiC

=

N

Fig. 4. Illustration of the main phases of the proposed model. (a) Oversegmentation of input image; (b) U O : Orientation uniqueness; (c) D O : Orientation distribution; (d) U C : Color uniqueness; (e) D C : Color distribution; (f) S O : Orientation feature map; (g) S C : Color feature map; (h) αS O + β S C : Fusion map from orientation and color feature maps; (i) F F: Fusion map from orientation feature map, color feature map and depth-from-focus feature; (j) F: Depth-from-focus map.

and j , respectively; the Gaussian weight w(ci , c j ) is defined as w(ci , c j ) = exp(− 2σ1 2 ci , c j 2 )/Z i . The paramec ter σc determines the sensitivity of color distribution of the N basic region. Z i is the normalization factor ensuring j =1 w(ci , c j ) = 1. Examples of color distribution calculation are shown in Fig. 2 and Fig. 4(e). Then, both UiC and DiC are normalized into the range [0, 1]. The saliency value SiC for each basic region is computed as: SiC =

UiC exp(kC · DiC )

,

(3)

where kC is used to balance the proportion between uniqueness and distribution measures. In [27], kC = 3 is adopted as the scaling factor in experiments. Lastly, the color feature map S C will be normalized to the range [0, 1]. Examples of color feature map are shown in Fig. 2 and Fig. 4(g). B. Orientation Feature Map

ci − c j ·w(pi , p j ), 2

(2)

Similarly, we calculate the orientation feature map based on the orientation uniqueness and orientation distribution. Here, we use local binary patterns (LBP) [41] based orientation descriptor as the orientation feature for each basic region. The Chi-square distance instead of Euclidean distance is used to compute the distance between two orientation descriptors. 1) Orientation Descriptor: An example of computing LBP for a pixel in a 3 × 3 neighborhood is shown in Fig. 3. The LBP value of a pixel is calculated as follows: P 1 x ≥0 i−1 t (v i − v c ) · 2 , t (x) = (4) LBP = 0 x < 0,

where p¯ i is the weighted mean position of color feature ci , N p¯ i = j =1 w(ci , c j ) · p j ; w(ci , c j ) describes the similarity for color feature ci and color feature c j for basic regions i

where v c is the intensity value of the central pixel, v i is the intensity value of the pixel adjacent to the central pixel in the n ×n neighborhood, P is n 2 −1 for a n ×n neighborhood, and function t (·) is a threshold operation. We use n = 3 in this

(1)

j =1

where ci is the mean color of all pixels in basic region i ; the Gaussian weight w(pi , p j ) is defined as w(pi , p j ) = 2 exp(− 2σ1 2 pi − p j ). σ p controls the range of uniqueness p operator. Examples of color uniqueness calculation are shown in Fig. 2 and Fig. 4(d). The color distribution of a basic region i is defined by spatial variance DiC of its color feature ci . DiC =

N

p j − p¯ i 2 ·w(ci , c j ),

j =1

i=1

4392


study. The orientation descriptor of a basic region is defined as the distribution of LBP values in tree channels of CIELab color space1 of all pixels in the basic region. This distribution is defined as a histogram with 256 bins, as shown in Fig. 3. 2) Orientation Uniqueness: Orientation uniqueness is used to evaluate the orientation difference between each basic region and others, which is defined as the orientation rarity of a basic region i compared with all other basic regions j : UiO =

N

χ(oi , o j ) · w(pi , p j ),

(5)

j =1

where N is the number of basic regions; pi is the position of basic region i , pi is normalized into the range of [0, 1]; oi is the orientation descriptor of basic region i ; χ(oi , o j ) is the Chi-square distance used to compute the distance between orientation descriptors o(i) and o( j ); the Gaussian weight 2 w(pi , p j ) is defined as w(pi , p j ) = exp(− 2σ1 2 pi − p j ). σ p p controls the range of the uniqueness operator, which is set to 0.25 in all experiments as in [27], to obtain a balance between local effects and global effects. In this study, we use the Chi-square distance instead of Euclidean distance to compute the difference between two orientation descriptors. The Chi-square distance between orientation descriptors o(i) and o( j ) is defined as: χ(o(i) , o( j ) ) =

B ( j) ( j) (i) (i) (ob − ob )2 /(ob + ob )

(6)

b=1 (i)

( j)

where B is the number of bins, and ob and ob are values of orientation descriptors o(i) and o( j ) at the bth bin respectively. The chi-square distance is a particular weighted Euclidean distance applicable to count data, which is calculated between the relative counts for each sample, called profiles [42]. The orientation feature based on LBP histogram is just one of the cases mentioned above. Therefore, we use Chi-square instead of Euclidean distance to compute the distance between two orientation descriptors. Examples of orientation uniqueness calculation are shown in Fig. 2 and Fig. 4(b). 3) Orientation Distribution: Spatial orientation distribution is used to measure the degree of orientation dispersion. The spatial distribution measurement of a basic region i is defined by spatial variance DiO of its orientation descriptor oi . Higher spatial variance implies higher degree of dispersion. DiO =

N

p j − p¯ i 2 ·w(oi , o j ),

(7)

j =1

where p j is the position of basic region j ; p¯ i is the weighted mean position of orientation descriptor oi , p¯ i = N j =1 w(oi , o j ) · p j ; w(oi , o j ) describes the similarity of oi and o j of basic regions i and j , respectively. The Gaussian weight w(oi , o j ) is defined as w(oi , o j ) = exp(− 2σ1 2 χ(oi , o j ))/Z i . The parameter σo controls the orieno tation sensitivity of distribution of the basic region. We use σo = 20 in all our experiments as in [27], which allows 1 With different sensors and acquired conditions, the image format may be different. In this study, we assume the images are in standard RGB format.

for a better orientation sensitivity. Z i is the normalization factor ensuring Nj=1 w(oi , o j ) = 1. Examples of orientation distribution calculation are shown in Fig. 2 and Fig. 4(c). 4) Orientation Saliency Calculation: By combining UiO and DiO , we compute the saliency value SiO from the orientation feature for each basic region as: SiO =

UiO exp(k O · DiO )

,

(8)

where positive k O is used to balance the proportion between uniqueness and distribution measures. The orientation feature map S O will be normalized to the range [0, 1]. Examples of orientation feature map are shown in Fig. 2 and Fig. 4(f). C. Depth-From-Focus Map In order to draw observers’ attention, photographers usually place the object which is most interesting or important in their opinions in focus, while other redundant information is placed out of focus. Therefore, the depth-from-focus of a single image is a common and significant top-down feature. Out-of-focus blur in images occurs when objects are placed out of the focal range of camera. Out-of-focus blur is usually modeled as Gaussian blurring with a standard deviation σ on a sharp image [43]. Many techniques have been proposed to address the problem [44]–[48]. Firstly, we estimate the blur amount of edges in the image. Then, the relative depth-from-focus of each basic region in the image is obtained by a Gaussian weighted combination of blur amount σ at edges in the original image. The detailed steps of depth-from-focus map estimation are demonstrated as follows. Firstly, the image is re-blurred using a Gaussian kernel with standard deviation σ0 . Then, the ratio R between the gradients of the original image and re-blurred image is computed at edges. As reported in [45], the blur amount σ of edges in the image is calculated by the following formula: σ =√

1

(9) σ0 . −1 We use the Canny edge detector to detect edges in the image and set the standard deviation σ0 of re-blurring as σ0 = 1. We normalize all blur amounts of edges into the range [0,1]. We define Fi , the relative depth-from-focus of basic region i , as a weighted combination of blur amount σ at edges (not all image pixels) in the original image. (1 − σ (x, y)) · w(x, y, x pi , ypi ), (10) Fi = R2

(x.y)∈I

where (x, y) is the position of each pixel at edges in image I ; σ (x, y) is blur amount of pixel I (x, y) at edges in the image; the Gaussian weight w(x, y, x pi , ypi ) is set as 2 w(x, y, x pi , ypi ) = exp(− 2σ1 2 (x, y) − (x pi , ypi ) ), where F σ F is used to control the sensitivity of the distance between the basic region and edges. A small σ F tends to increase the effect of local blur amounts; a large σ F tends to increase the effect of global blur amounts. The region-wise depth-fromfocus map F will be normalized to the range [0, 1]. Examples of depth-from-focus map are shown in Fig. 1 and Fig. 4(j).


Fig. 5. Illustration of the main phases of our fusion method. (a) Input O : Salient image; (b) Ground truth. (c) S O : Orientation feature map; (d) Scut basic regions of orientation feature map, which meets (16); (e) S C : Color C : Salient basic regions of color feature map, which meets feature map; (f) Scut (16); (g) αS O + β S C : Fusion map of orientation and color feature maps with adaptive weights; (h) FF: Fusion map of orientation feature map, color feature map and depth-from-focus feature.

D. Feature Map Fusion Based on Depth-From-Focus According to FIT, the binding in the second stage can effectively fuse bottom-up feature maps, highlight targets and suppress distractors. We design a more general and effective fusion method in this section. The main steps of the proposed fusion method are shown in Fig. 5. The proposed fusion method can adjust the weights for different feature maps to reflect the confidence level of each feature map. The fusion methods in [20] and [28] are special cases of the proposed method. The proposed fusion method linearly combines orientation feature map and color feature map with adaptive weights. The weights of orientation and color feature maps are adjusted adaptively according to the DoS (degree-ofscattering) and eccentricity of each feature map. Examples of combining orientation and color feature maps are shown in Fig. 4(h) and Fig. 5(g). As the objects in focus attract more attention than those out of focus, the proposed fusion method will adopt depth-from-focus to filter out redundant visual information of background and highlight salient objects in focus, as shown in Fig. 4(i) and Fig. 5(h). The final saliency value Si of basic region i is assigned as Si = (αSiO + β SiC ) · Fi ,

(11)

where SiO and SiC are orientation saliency and color saliency of basic region i , respectively; Fi is the depth-from-focus of basic region i ; α and β are parameters to determine the weights of orientation and color feature maps, respectively. Generally, the salient regions are small and dense. Thus, we can determine the weighting parameters for orientation and color feature maps based on the DoS of feature maps. The feature map with higher DoS (see Fig. 5(c) and (d)) will be assigned a lower weight, and vice-versa. The DoS is defined as the variance of spatial distances between the centroid of feature map and all salient basic regions in the feature map. Accordingly, the computation of DoS consists of three steps. Firstly, we calculate the centroid of orientation feature map H O , color feature map H C , and depth-from-focus map H F with the method of [49]. We select the centroid of orientation feature map or color feature map as the final centroid according to their centroid eccentricities. The centroid eccentricities of orientation and color feature maps are defined as: k H − H¯ k , (12) E = H O − H¯ + H C − H¯

4393

where k ∈ {O, C}, H¯ is the mean position of the centroid of orientation feature map H O , the centroid of color feature map H C , and the centroid of depth-from-focus map H F . The centroid with a lower eccentricity is much more reliable, so the final centroid is the centroid of the feature map with a lower eccentricity. Accordingly, we determine the final centroid H according to the following formula: O H , E O ≤ EC H= (13) H C, E O > EC. Secondly, we compute the average distance between the final centroid and all salient basic regions according to: Msk pm − H k , (14) d¯ = m=1 k Ms where k ∈ {O, C}, Msk is the number of salient basic regions in feature map k; pm is the position of salient basic region m, m ∈ [1, Msk ]. The salient basic regions and non-salient basic regions are separated by an adaptive threshold Tsk , as shown in Fig. 5(d) and (f). Threshold Tsk is defined as the mean saliency of feature map k: N Sk k (15) Ts = i=1 i , N where Sik is the saliency value of basic region i in feature map k, k ∈ {O, C}. Accordingly, pm in (14) is the position of the salient basic region which meets the following requirement: Sik > Tsk , k ∈ {O, C}.

(16)

Finally, we calculate the variance of spatial distances between the final centroid H and all salient basic regions in each feature map. The variance V k is defined as: Msk (pm − H − d¯k )2 k V = m=1 , k ∈ {O, C}. (17) Msk A higher variance implies higher DoS, and vice-versa. Therefore, we define the DoS of each feature map as: Vk , k ∈ {O, C}. (18) V O + VC In the proposed fusion method, the feature map with lower DoS is assigned with a higher weight in the linear combination of feature maps. Therefore, α and β in (11) are set as: α = 1 − Sct O (19) β = 1 − Sct C . Sct k =

As shown in Fig. 5, the DoS of orientation feature map (Fig. 5(c)) is higher than that of color feature map (Fig. 5(e)). Therefore, according to (19), α = 0.29 and β = 0.71 for orientation and color feature maps, respectively. Lastly, we assign the saliency value to each pixel for generating the full resolution saliency map by using an up-sampling method in [27] and [50]. The saliency S˙ j of a pixel is a weighted linear combination of the saliency Si of other basic regions: S˙ j =

N i=1

w j i Si ,

(20)

4394


Fig. 6.

where

the

Visual comparison of state-of-art detection models with our proposed model (FF) and ground truth (GT).

Gaussian

weight

w ji

=

1 Zj

exp(− 12

1 1 ( 30 c j − ci 2 + 30 p j − pi 2 )). Z j is the normalization N factor ensuring i=1 w( j, i ) = 1. Finally, we rescale the full resolution saliency map to the range [0, 1].

III. E XPERIMENTS In this section, we evaluate the performance of the proposed model on three data sets. Firstly, we conduct an experiment to demonstrate the advantage of the proposed model on a publicly available database (termed DB-1000) provided by Achanta et al. [25]. This database includes 1000 original images and their corresponding ground-truth in the form of accurate human-labeled salient objects. We compare the proposed model (termed FF) with 17 models on this database. They are IT [20], SR [51], HC [26], RC [26], GB [23], CA [24], FT [25], AC [25], SF [27], BT [33], LR [34], MZ [21], LC [52], VU [30], IS [53], AIM [54], and AWS [55]. Fig. 6 shows some comparison samples from these models. Saliency maps of some existing models (e.g. SR, MZ, LC, IT, GB, AC, CA, FT, HC, and RC) are obtained from [26]. Saliency maps of SF, BT, LR, VU, IS, AIM, and AWS are from [27], [33], [34], [30], [53], [54], and [55], respectively. We also provide the experimental results compared with several state-of-art methods on the database of Microsoft [22] in Section III-C, which is also a popular database in evaluating the performance of salient region detection models. This image database termed DB-5000 includes 5000 original images and their corresponding ground-truth indicated with bounding boxes by 9 subjects. Following these experiments on DB-1000 and DB-5000, we provide the experimental results compared with several state-of-art methods on DB-Bruce [54] database in Section III-D, which is an eye tracking database of natural images.

Fig. 7. Evaluation of orientation, color, and final saliency map. (a) comparison among orientation feature map, color feature map, and final map; (b) comparison among orientation feature map, color feature map, and depthfrom-focus map.

A. Evaluation of Orientation, Color, and Final Saliency Map In this section, we evaluate the performance of the orientation feature map, color feature map and final saliency map. Similar with [22], [25]–[27], we evaluate the performance of our model by using the precision and recall rates. The experiment in this section is conducted on DB-1000. In this experiment, the saliency map is segmented according to the saliency values with fixed threshold. Given a threshold T f ∈ [0, 1], the regions whose saliency values are higher than T f are marked as salient regions. We generate 256 binary segmentations by thresholding the saliency maps with 256 threshold values T f , which varies from 0 to 1. Precision rate is calculated as: tp , (21) Pr eci si on = tp + f p where t p (true positive) is the number of pixels which are inside the salient regions of both saliency map and ground truth map. f p (false positive) is the number of pixels which are inside the salient region of saliency map and non-salient


Fig. 8.

Fig. 9.

4395

Precision and recall rates of the compared models. (a) and (b): comparison results on DB-1000; (c) comparison resutls on DB-5000.

Precision, recall, and F-measure results from adaptive thresholds. (a): comparison results on DB-1000. (b): comparison results on DB-5000.

region of ground truth map. Recall rate is defined as: tp Recall = , tp + f n

(22)

where f n (false negative) is the number of pixels which are inside the salient region of ground truth map and non-salient region of saliency map. The results of precision and recall rates in Fig. 7 clearly demonstrate that the performance of the fusion map from the orientation, color and depth-from-focus is the best among the compared methods. In Fig. 7, SO is the precision-recall curve of orientation feature map, and SC is the precision-recall curve of color feature map. From these two curves, we can infer that color plays more important role in salient region detection on DB-1000 compared with the orientation. When we combine color feature map and depth-from-focus map by region-wise Hadamard product of these two maps, we can get the SC · F curve; we get the (SO + SC )/2 curve when we combine orientation and color feature maps by averaging these two maps. From SC · F curve and (SO + SC )/2 curve in Fig. 7, we know that both orientation and depth-from-focus are important attributes for salient region detection as valuable supplements of color feature. In Fig. 7, FF is the result of the proposed fusion scheme. Comparing with (SO + SC )/2 and SC · F, it is clear that the proposed fusion scheme obtains more accurate results with higher precision and better recall rates. B. Comparison on DB-1000 In this subsection, we provide an exhaustive comparison for our model (FF) with 17 state-of-art methods including

SR, MZ, LC, IT, GB, AC, CA, FT, HC, RC, SF, BT, LR, IS, AIM, AWS, and VU on DB-1000. Fig. 6 shows some comparison samples from different models. As shown in Fig. 6, FF consistently produces saliency maps closest to the ground truth among the compared models. In this experiment, the saliency map is segmented according to the saliency values with fixed threshold. Given a threshold T f ∈ [0, 1], the regions whose saliency value are higher than T f are marked as salient regions. We also generate 256 binary segmentations by thresholding the saliency maps with 256 threshold values T f varying from 0 to 1. Fig. 8(a) and (b) show the resulting precision vs. recall curves. As shown in Fig. 8(a) and (b), our model (FF) outperforms other 17 salient region detection models including the latest models SF, BT, and LR at every threshold and for any recall rate. Furthermore, similar with [25], [27], we use the image dependent adaptive thresholds to evaluate the performance of our model. The image dependent adaptive threshold is defined by [25] as twice of the mean saliency of the saliency map. In addition to precision and recall rates, the F-measure is introduced to evaluate the performance of all models. The F-measure is defined as: Fγ =

(1 + γ 2 ) · Pr eci si on · Recall . γ 2 · Pr eci si on + Recall

(23)

where we use γ 2 = 0.3 as in [25]–[27] to weight precision rate more than recall rate. The comparison results are shown in Fig. 9(a). As shown in Fig. 9(a), our model outperforms others in terms of precision, recall, and F-measure. Among the 17 saliency detection models used in comparison experiments, BT [33] and LR [34] use the top-down cues.

4396


TABLE I T HE AUC C OMPARISON ON DB-B RUCE D ATABASE

TABLE II T HE S HUFFLED -AUC C OMPARISON R ESULTS ON DB-B RUCE Fig. 10.

ROC curves of the compared models on DB-Bruce.

BT uses top-part of humans (head area), face components (eyes, nose, and mouth), and center-bias as top-down information, while LR adopts location prior, semantic prior (e.g., faces), and color prior as top-down cue. As shown in Figs. 8 and 9, the proposed method can obtain better performance than these two in saliency prediction. C. Comparison on DB-5000 DB-5000 is also popular on performance evaluation of salient region detection models. The ground-truth of DB-5000 is labeled by 9 subjects in the form of bounding boxes. In this experiment, we calculate the ground-truth map for images by averaging the 9 subjects’ labeled-data (similar with [22]). In order to accurately evaluate the performance of the proposed model, we also provide the experimental results compared with several state-of-art methods including SR, LC, FT, HC, RC, SF, VU, IS, AIM, and AWS on DB-5000. Fig. 8(c) shows the resulting precision vs. recall curves on DB-5000. As shown in Fig. 8(c), FF outperforms SR, LC, FT, HC, RC, SF, IS, AIM, and AWS at every threshold and for any recall rate. Secondly, we use the image dependent adaptive threshold to evaluate the performance of our model and the comparison results of precision, recall, and F-measure are shown in Fig. 9(b). The image dependent adaptive threshold Ta is also defined as twice of the mean saliency of saliency map. As shown in Fig. 8(c) and Fig. 9(b), FF clearly outperforms SR, LC, FT, HC, RC, SF, and IS in terms of precision, recall, and F-measure on DB-5000. D. Comparison on DB-Bruce We adopt the DB-Bruce [54], an eye tracking database including 120 images with eye fixation data, to evaluate the prediction performance of the proposed model. In order to accurately evaluate the performance of the proposed model, we also provide the experimental results compared with the state-of-art methods including IT, SR, IS, AIM, AWS, LC, FT, HC, RC, and SF on DB-Bruce. Here, we use the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) [56] to evaluate the performance of saliency detection models. Fig. 10 provides the comparison results for ROC curves and the AUC results are given in Table I. As shown in Fig. 10, the ROC curves of the proposed model (FF) and bottom-up components (SO and SC ) outperforms SR, LC, FT, and HC models. As shown in Table I, the AUC of FF is

highest among the compared models. We also conduct the comparison experiments by using shuffled-AUC [57]. For the evaluation of the algorithm, we used the same procedure as that of [58]. Specifically, the shuffling of saliency maps is repeated 100 times. When calculating the area under ROC curve, we also used 100 random permutations. The AUC values are reported in Table II. From Table II, the proposed algorithm outperforms other state-of-the-art methods in terms of shuffled-AUC. IV. C ONCLUSION According to FIT, a two-stage salient region detection model has been proposed by fusing bottom-up and top-down features extracted from a single image in this paper. Existing global contrast-based models only consider the contrast of color feature and neglect the important role of other features such as orientation. We have formulated a comprehensive set of features including the contrast of orientation to overcome the shortcomings of the existing models. In existing models, the fusion methods of binding feature maps (e.g. the averaging based fusion method or the selective fusion method) are not effective sufficiently. Therefore, a more general and effective fusion method has been also proposed in this paper. It can adjust the weights adaptively to different feature maps to reflect the confidence level of each feature map. It is designed based on the DoS and eccentricities of feature maps, and the guidance of depth-from-focus. The depth-from-focus of an image, which is a common and significant top-down feature for visual attention and easier to obtain, is used to yield the final saliency map for images. The proposed fusion can exclude outof-focus visual information out of focus and highlight salient regions in focus with the aid of depth-from-focus detection. The evaluation of the proposed model has been carried out on three data sets. As indicated in the experimental results, the proposed model outperforms many existing relevant salient region detection methods. R EFERENCES [1] Y. Tian, J. Li, S. Yu, and T. Huang, “Learning complementary saliency priors for foreground object segmentation in complex scenes,” Int. J. Comput. Vis., Jul. 2014, doi: 10.1007/s11263-014-0737-1. [2] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up attention useful for object recognition?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 2. Jun. 2004, pp. II-37–II-44. [3] Y. Fang, Z. Chen, W. Lin, and C.-W. Lin, “Saliency detection in the compressed domain for adaptive image retargeting,” IEEE Trans. Image Process., vol. 21, no. 9, pp. 3888–3901, Sep. 2012.


[4] M. Ding and R.-F. Tong, “Content-aware copying and pasting in images,” Vis. Comput., vol. 26, nos. 6–8, pp. 721–729, Jun. 2010. [5] H. Wu, Y.-S. Wang, K.-C. Feng, T.-T. Wong, T.-Y. Lee, and P.-A. Heng, “Resizing by symmetry-summarization,” in Proc. ACM SIGGRAPH Asia, Dec. 2010, pp. 159-1–159-10. [6] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu, “Sketch2photo: Internet image montage,” ACM Trans. Graph., vol. 28, no. 5, pp. 124-1–124-10, Dec. 2009. [7] X. Hou, J. Harel, and C. Koch, “Image signature: Highlighting sparse salient regions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 1, pp. 194–201, Jan. 2012. [8] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu, “Salient object detection and segmentation,” Dept. Comput. Sci. Technol., Tsinghua Univ., Beijing, China, Tech. Rep. 1, 2012. [9] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 185–207, Jan. 2013. [10] A. M. Treisman and G. Gelade, “A feature-integration theory of attention,” Cognit. Psychol., vol. 12, no. 1, pp. 97–136, Jan. 1980. [11] A. Treisman, “Preattentive processing in vision,” Comput. Vis., Graph., Image Process., vol. 31, no. 2, pp. 156–177, Aug. 1985. [12] A. Treisman and S. Gormican, “Feature analysis in early vision: Evidence from search asymmetries,” Psychol. Rev., vol. 95, no. 1, pp. 15–48, Jan. 1988. [13] J. M. Wolfe, “Guided search 4.0: Current progress with a model of visual search,” in Integrated Models of Cognitive Systems, W. Gray, Ed. New York, NY, USA: Oxford, 2007, pp. 99–119. [14] J. M. Wolfe and T. S. Horowitz, “What attributes guide the deployment of visual attention and how do they do it?” Nature Rev. Neurosci., vol. 5, pp. 495–501, Jun. 2004. [15] A. Thiele and G. Stoner, “Neuronal synchrony does not correlate with motion coherence in cortical area MT,” Nature, vol. 421, no. 6921, pp. 366–370, Jun. 2003. [16] P. Fries, S. Neuenschwander, A. K. Engel, R. Goebel, and W. Singer, “Rapid feature selective neuronal synchronization through correlated latency shifting,” Nature Neurosci., vol. 4, no. 2, pp. 495–501, Jun. 2001. [17] A. Treisman, “The binding problem,” Current Opinion Neurobiol., vol. 6, no. 2, pp. 171–178, Aug. 1996. [18] J. W. Bisley and M. E. Goldberg, “Attention, intention, and priority in the parietal lobe,” Annu. Rev. Neurosci., vol. 33, no. 1, pp. 1–21, Jul. 2010. [19] D. Kahneman, A. Treisman, and B. J. Gibbs, “The reviewing of object files: Object-specific integration of information,” Cognit. Psychol., vol. 24, no. 2, pp. 175–219, Apr. 1992. [20] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998. [21] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention analysis by using fuzzy growing,” in Proc. 11th Int. Conf. ACM Multimedia, Nov. 2003, pp. 374–381. [22] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2007, pp. 1–8. [23] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Neural Information Processing System. Cambridge, MA, USA: MIT Press, 2006, pp. 545–552. [24] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 2376–2383. [25] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 1597–1604. [26] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 409–416. [27] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 733–740. [28] V. Gopalakrishnan, Y. Hu, and D. Rajan, “Salient region detection by modeling distributions of color and orientation,” IEEE Trans. Multimedia, vol. 11, no. 5, pp. 892–905, Aug. 2009. [29] Y. Hu, X. Xie, W.-Y. Ma, L.-T. Chia, and D. Rajan, “Salient region detection using weighted feature maps based on the human visual attention model,” in Advances in Multimedia Information Processing-PCM. New York, NY, USA: Springer-Verlag, 2005, pp. 993–1000.

4397

[30] C. T. Vu and D. M. Chandler, “Main subject detection via adaptive feature refinement,” J. Electron. Imag., vol. 20, no. 1, p. 013011, Mar. 2011. [31] A. Torralba, A. Oliva, M. S. Castelhano, and J. M. Henderson, “Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search,” Psychol. Rev., vol. 113, no. 4, p. 766, 2006. [32] Y. Fang, W. Lin, C. T. Lau, and B.-S. Lee, “A visual attention model combining top-down and bottom-up mechanisms for salient object detection,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2011, pp. 1293–1296. [33] A. Borji, “Boosting bottom-up and top-down visual features for saliency estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 438–445. [34] X. Shen and Y. Wu, “A unified approach to salient object detection via low rank matrix recovery,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 853–860. [35] J. H. Reynolds and D. J. Heeger, “The normalization model of attention,” Neuron, vol. 61, no. 2, pp. 168–185, 2009. [36] F. Baluch and L. Itti, “Mechanisms of top-down attention,” Trends Neurosci., vol. 34, no. 4, pp. 210–224, 2011. [37] L. Elazary and L. Itti, “Interesting objects are visually salient,” J. Vis., vol. 8, no. 3, pp. 3.1–3.15, Mar. 2008. [38] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in Proc. IEEE 12th Int. Conf. Comput. Vis. (ICCV), Sep./Oct. 2009, pp. 2106–2113. [39] S. Frintrop, E. Rome, and H. I. Christensen, “Computational visual attention systems and their cognitive foundations: A survey,” ACM Trans. Appl. Perception, vol. 7, no. 1, p. 6, 2010. [40] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, Nov. 2012. [41] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, Jul. 2002. [42] M. Greenacre. (2008). Measures of Distance Between Samples: Euclidean. [Online]. Available: http://www.econ.upf.edu/∼michael/ stanford/maeb4.pdf [43] A. Pentland, T. Darrell, M. Turk, and W. Huang, “A simple, real-time range camera,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 1989, pp. 256–261. [44] J. H. Elder and S. W. Zucker, “Local scale control for edge detection and blur estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 7, pp. 699–716, Jul. 1998. [45] S. Zhuo and T. Sim, “Defocus map estimation from a single image,” Pattern Recognit., vol. 44, no. 9, pp. 1852–1858, Sep. 2011. [46] S. Wu, W. Lin, S. Xie, Z. Lu, E. P. Ong, and S. Yao, “Blind blur assessment for vision-based applications,” J. Vis. Commun. Image Represent., vol. 20, no. 4, pp. 231–241, May 2009. [47] R. Liu, Z. Li, and J. Jia, “Image partial blur detection and classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2008, pp. 1–8. [48] Y.-W. Tai and M. S. Brown, “Single image defocus map estimation using local contrast prior,” in Proc. 16th IEEE Int. Conf. Image Process. (ICIP), Nov. 2009, pp. 1797–1800. [49] M.-K. Hu, “Visual pattern recognition by moment invariants,” IRE Trans. Inf. Theory, vol. 8, no. 2, pp. 179–187, Feb. 1962. [50] J. Dolson, J. Baek, C. Plagemann, and S. Thrun, “Upsampling range data in dynamic environments,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 1141–1148. [51] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2007, pp. 1–8. [52] Y. Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in Proc. 14th Annu. ACM Int. Conf. Multimedia, 2006, pp. 815–824. [53] X. Hou, J. Harel, and C. Koch, “Image signature: Highlighting sparse salient regions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 1, pp. 194–201, Jan. 2012. [54] N. D. B. Bruce and J. K. Tsotsos, “Saliency based on information maximization,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2005, pp. 155–162. [55] A. Garcia-Diaz, V. Leborán, X. R. Fdez-Vidal, and X. M. Pardo, “On the relationship between optical variability, visual saliency, and eye fixations: A computational approach,” J. Vis., vol. 12, no. 6, pp. 1–22, 2012.

4398


[56] N. Murray, M. Vanrell, X. Otazu, and C. A. Parraga, “Saliency estimation using a non-parametric low-level vision model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 433–440. [57] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist, “Visual correlates of fixation selection: Effects of scale and time,” Vis. Res., vol. 45, no. 5, pp. 643–659, 2005. [58] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, “Sun: A Bayesian framework for saliency using natural statistics,” J. Vis., vol. 8, no. 7, pp. 1–3, 2008.

Huawei Tian received the B.E. degree from the College of Computer Science, Chongqing University, Chongqing, China, in 2006, and the Ph.D. degree from the Institute of Information Science, Beijing Jiaotong University, Beijing, China, in 2013. Since 2013, he has been with the faculty of the People’s Public Security University of China, Beijing. His research interests include image processing, pattern recognition, digital watermarking and steganography, digital forensics, visual attention, and 3D videos.

Yuming Fang is currently a Faculty Member with the School of Information Technology, Jiangxi University of Finance and Economics, Nanchang, China. He received the Ph.D. degree in computer engineering from Nanyang Technological University, Singapore, and the B.E. and M.S. degrees from Sichuan University, Chengdu, China, and the Beijing University of Technology, Beijing, China, respectively. From 2011 to 2012, he was a Visiting Ph.D. student with National Tsing Hua University, Hsinchu, Taiwan. In 2012, he was a Visiting Scholar with the University of Waterloo, Waterloo, ON, Canada. He was also a (Visiting) Post-Doctoral Research Fellow with the IRCCyN Laboratory, PolyTech’ Nantes - University of Nantes, Nantes, France, University of Waterloo, and Nanyang Technological University. His research interests include visual attention modeling, visual quality assessment, image retargeting, computer vision, and 3D image/video processing. He was a Secretary of the 2013 Joint Conference on Harmonious Human Machine Environment. He was also a Workshop Organizer in the 2014 International Conference on Multimedia and Expo, and a Special Session Organizer in the 2013 Visual Communications and Image Processing conference and the 2014 International Workshop on Quality of Multimedia Experience.

Yao Zhao (M’06–SM’12) received the B.S. degree from the Department of Radio Engineering, Fuzhou University, Fuzhou, China, in 1989, the M.E. degree from the Department of Radio Engineering, Southeast University, Nanjing, China, in 1992, and the Ph.D. degree from the Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing, China, in 1996. He became an Associate Professor and a Professor at BJTU, in 1998 and 2001, respectively. From 2001 to 2002, he was a Senior Research Fellow with the Information and Communication Theory Group, Faculty of Information Technology and Systems, Delft University of Technology, Delft, The Netherlands. He is currently the Director of the Institute of Information Science with BJTU. His current research interests include image/video coding, digital watermarking and forensics, and video analysis and understanding. He serves on the Editorial Boards of several international journals, including as an Associate Editor of the IEEE T RANSACTIONS ON C YBERNETICS , the IEEE S IGNAL P ROCESSING L ETTERS , and an Area Editor of Signal Processing: Image Communication (Elsevier). He was named a Distinguished Young Scholar by the National Science Foundation of China in 2010, and was elected as a Chang Jiang Scholar of the Ministry of Education of China in 2013. He is a fellow of the Institution of Engineering and Technology.

Weisi Lin (M’92–SM’98) received the Ph.D. degree from King’s College London, London, U.K. He is currently an Associate Professor of Computer Engineering, Nanyang Technological University, Singapore. He was the Lab Head and an Acting Department Manager of Media Processing with the Institute for Infocomm Research, Singapore. His research areas include image processing, perceptual multimedia modeling and evaluation, and visual signal compression and communication. He has authored over 100 refereed journal papers and over 170 conference papers, and holds seven patents. He was on the Editorial Boards of the IEEE T RANSACTIONS ON M ULTIMEDIA (2011–2013), the IEEE S IGNAL P ROCESSING L ETTERS , and the Journal of Visual Communication and Image Representation. He currently chairs the IEEE MMTC IG on Quality-of-Experience. He was elected as an APSIPA Distinguished Lecturer (2012–2013). He was the Technical-Program Chair of the 2013 IEEE International Conference on Multimedia and Expo and the 2012 Pacific-Rim Conference on Multimedia, and the TechnicalProgram Chair of the 2014 International Workshop on Quality of Multimedia Experience. He is a fellow of the Institution of Engineering Technology, and an Honorary Fellow of the Singapore Institute of Engineering Technologists.

Rongrong Ni received the Ph.D. degree in signal and information processing from the Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing, China, in 2005. Since Spring 2005, she has been with the faculty of the School of Computer and Information Technology and the Institute of Information Science, BJTU, where she has been a Professor since 2013. Her research interests include image processing, data hiding and digital forensics, and pattern recognition.

Zhenfeng Zhu received the Ph.D. degree in pattern recognition and intelligence system from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2005. He is currently an Associate Professor with the Institute of Information Science, Beijing Jiaotong University, Beijing. He was a Visiting Scholar with the Department of Computer Science and Engineering, Arizona State University, Phoenix, AZ, USA, in 2010. His research interests include image and video understanding, computer vision, and machine learning.

Covert photo classification by fusing image features and visual attributes.

Salient features of Haemophilus vaginalis.

Using computer-extracted image features for modeling of error-making patterns in detection of mammographic masses among radiology residents.

Molecular detection of Bartonella henselae in 11 Ixodes ricinus ticks extracted from a single cat.

A salient region detection model combining background distribution measure for indoor robots.

Salient Region Detection via Integrating Diffusion-Based Compactness and Local Contrast.

Salient Object Detection via Structured Matrix Decomposition.

Single Image Superresolution via Directional Group Sparsity and Directional Features.

Estimating the Information Extracted by a Single Spiking Neuron from a Continuous Input Time Series.

Nanometric features of myosin filaments extracted from a single muscle fiber to uncover the mechanisms underlying organized motility.

Salient Features of Endonuclease Platforms for Therapeutic Genome Editing.

On the Distribution of Salient Objects in Web Images and Its Influence on Salient Object Detection.

Monitoring cardiac stress using features extracted from S₁ heart sounds.

Fast SAR image change detection using Bayesian approach based difference image and modified statistical region merging.

A novel edge detection in medical images by fusing of multi-model from different spatial structure clues.

Salient features of the ciliated organ of asymmetry.

Emotional content of an image attracts attention more than visually salient features in various signal-to-noise ratio conditions.

Gene network inference by fusing data from diverse distributions.

Structural Features and Potent Antidepressant Effects of Total Sterols and β-sitosterol Extracted from Sargassum horneri.

A protocol for 3D image reconstruction from a single image of an oblique section.

Automatic blastomere recognition from a single embryo image.

Image Generation Using Bidirectional Integral Features for Face Recognition with a Single Sample per Person.

Single virus detection by means of atomic force microscopy in combination with advanced image analysis.

Efficiently modeling 3D scenes from a single image.