IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

3321

Adaptive Metric Learning for Saliency Detection Shuang Li, Huchuan Lu, Senior Member, IEEE, Zhe Lin, Member, IEEE, Xiaohui Shen, Member, IEEE, and Brian Price

Abstract— In this paper, we propose a novel adaptive metric learning algorithm (AML) for visual saliency detection. A key observation is that the saliency of a superpixel can be estimated by the distance from the most certain foreground and background seeds. Instead of measuring distance on the Euclidean space, we present a learning method based on two complementary Mahalanobis distance metrics: 1) generic metric learning (GML) and 2) specific metric learning (SML). GML aims at the global distribution of the whole training set, while SML considers the specific structure of a single image. Considering that multiple similarity measures from different views may enhance the relevant information and alleviate the irrelevant one, we try to fuse the GML and SML together and experimentally find the combining result does work well. Different from the most existing methods which are directly based on low-level features, we devise a superpixelwise Fisher vector coding approach to better distinguish salient objects from the background. We also propose an accurate seeds selection mechanism and exploit contextual and multiscale information when constructing the final saliency map. Experimental results on various image sets show that the proposed AML performs favorably against the state-of-the-arts. Index Terms— Metric learning, saliency detection, Mahalanobis distance, Fisher vector.

I. I NTRODUCTION

V

ISUAL saliency aims at finding the regions on an image that are more visually distinctive or important and often serves as a pre-processing procedure for many vision tasks, such as image categorization [1], image retrieval [2], image compression [3], content-aware image/video resizing [4], etc. Visual saliency basically breaks down into the problem of separating the salient regions from the non-salient ones by measuring differences in their features. Numerous models and algorithms have been proposed to perform this. Unsupervised approaches [5]–[9] are stimuli-driven and rely largely on distinguishing low-level visual features. Early unsupervised models, such as Gaussian pyramids [5], central-surround [5], fuzzy growing [10] are mainly inspired by original biological

Manuscript received August 21, 2014; revised February 10, 2015 and April 10, 2015; accepted May 26, 2015. Date of publication June 3, 2015; date of current version June 23, 2015. This work was supported in part by the Natural Science Foundation of China under Grant 61472060 and in part by the Fundamental Research Funds for the Central Universities under Grant DUT14YQ101. The associate editor coordinating the review of this manuscript and approving it for publication was Mr. Pierre-Marc Jodoin. S. Li and H. Lu are with the School of Information and Communication Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]; [email protected]). Z. Lin, X. Shen, and B. Price are with Adobe Research, San Jose, CA 95110 USA (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2440755

Fig. 1. The comparison between the Euclidean distance space and the Mahalanobis distance space. The Mahalanobis distance is more discriminative than the Euclidean distance, since its background part is less salient.

vision stimulus. Later studies address saliency detection from broader views, e.g., convex hull [7], [11] and frequency domain [12], [13]. In contrast, supervised methods [14]–[16] incorporate high-level and known information to better distinguish the salient regions by learning salient visual information from a large number of images with ground truth labels. Despite the differences in these methods, they all require the basic ability to compute a difference measure on some regions features to distinguish them. To the best of our knowledge, all existing models address saliency detection based on the Euclidean distance. However, Euclidean distance weights features equally without considering the distribution of the data, thereby it becomes invalid when detecting objects in complex images. This phenomenon happens frequently in the saliency detection process, especially when the salient regions and backgrounds are similar, which leads to the problem that the Euclidean distances between the foregrounds and the similar backgrounds are smaller than the distances within the foregrounds. Figure 1 illustrates this problem. Given an image, we first select some initial seeds, including foreground and backgrounds seeds. The process of seeds selection is the same as Section III-C mentioned. We compute the distance between each superpixel and seeds and draw the distance distribution in Figure 1. We observed that the Mahalanobis distance is more distinctive than the Euclidean distance, since its background part is less salient. This motivates us to train a discriminative distance metric to assign appropriate weights to features so that the objects can be precisely separated from the background. We use metric learning to compute a more discriminative distance measure. Distance metric learning has been widely

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

3322

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 2. The comparison between low-level features and our SFV feature. (a) input image. (b) saliency map based on low-level features. (c) saliency map based on SFV. (d) ground truth.

Fig. 3.

Pipeline of the adaptive metric learning algorithm. IT [5], GB [19], LR [14], RC [6] are other four saliency methods.

adopted for this purpose in different applications since it takes into account the covariance information when estimating the data distributions and improves the performance of learning methods significantly. To our knowledge, we are the first to successfully formulate the saliency detection problem into a metric learning framework and our method works well on different databases. We also propose a Superpixel-wise Fisher Vector coding approach which maps the low-level features, such as RGB and LAB, to high dimensional sparse vector. Compared with using low-level features directly, the SFV is more discriminative in challenging environments as shown in Figure 2. Thus we use SFV features to describe each superpixel. In this paper, we adopt an effective feature coding method and propose a novel metric learning based saliency detection model, which incorporates both supervised and semi-supervised information. Our algorithm considers both the global distribution of the whole training dataset (GML) and the typical structure of a specific image (SML), and we successfully fuse them together to extract the clustering characteristics for estimating the final saliency map. Figure 3 shows the pipeline of our method. First, as an extension of the traditional Fisher Vector coding [17], Superpixel-wise Fisher Vector coding is proposed to describe superpixels by learning the parameters of a Gaussian mixture model (Section III-A). Second, we train a Generic metric from the training set (Section III-B1) and apply it to a single image to find the saliency seeds with the assistance of the superpixel-wise

objectness map generated by [18] (Section III-C). Third, a Specific metric based on kernel classification is learnt from the chosen seeds for each image (Section III-B2). Finally, by integrating the Generic metric and Specific metric together (Section III-D), we obtain the clustering information for each superpixel and use it to generate the final saliency map (Section III-E). The GML and SML as shown in Figure 3 are two intermediate images which are not really generated when computing saliency maps. But they serve as comparisons to demonstrate the efficiency of the fused results in Section IV-A. The main contributions of our work include: • Two metric learning approaches are first applied to saliency detection as the optimal distance measure of two superpixels. GML is learnt from the global training set while SML is learnt from the specific image training samples. They are complementary to each other and achieve promising results after the affinity aggregation. • A superpixel-wise fisher vector coding method is first put forward which contains image contextual information when representing superpixels and makes supervised learning methods more suitable for single image processing. • An accurate seeds selection method is first presented based on the Mahalanobis distance metric. The selected seeds serve as training samples of the Specific metric learning and reference nodes when evaluating saliency values.

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION

Experimental results on various image sets show that our method is comparable with most of the state-of-the-arts and the proposed metric learning approaches can be extended to other fields as well. II. R ELATED W ORK Significant improvement and prosperity in saliency detection have been witnessed in recent years. Numerous unsupervised approaches have been proposed under different theoretical models. Cheng et al. [6] propose a global region contrast algorithm which simultaneously considers the spatial coherence across the regions and the global contrast over the entire image. However, low-level color contrast becomes invalid when dealing with challenging scenes. Li et al. [20] compute the dense and sparse reconstruction errors based on background templates which are extracted from image boundaries. They propose several integration strategies, such as multi-scale reconstruction error and Bayesian integration, which improve the performance of saliency detection significantly. In [21], boundary connectivity, a robust background measure, is first applied to saliency detection. It characterizes the spatial layout of image regions and provides a specific geometrical explanation to its definition. Perazzi et al. [22] formulate the saliency estimation and complete contrast using high-dimensional Gaussian filters. They modify SLIC [23] and demonstrate the effectiveness of their superpixel segmentation approach in detecting salient objects. Furthermore, lacking the knowledge of sizes and locations of objects, boundary prior and objectness are often adopted to highlight the salient regions or depress the backgrounds. Jiang et al. [18] construct saliency by integrating three visual cues, including uniqueness, focusness and objectness (UFO), where uniqueness represents color contrast; focusness indicates the degree of focus, often appearing as the reverse of blurriness; objectness proposed by Alexe et al. [24] is the likelihood of a given image window containing an object. In [25], Wei et al. define the saliency value of each patch as the shortest distance to the image boundary, observing that image boundaries are more likely to be the background. However, this assumption is less convincing, especially when the scene is challenging. Compared with unsupervised approaches, supervised methods are apparently rare. In [26] and [27], Jiang et al. also propose a multi-scale learning approach, which maps the regional feature vector to a saliency score and fuse these scores across multiple levels to generate the final saliency map. They introduce a novel feature vector, which integrates the regional contrast, regional property and regional backgroundness descriptors together, to represent each region and learn a discriminative random forest regressor to predict regional scores. Shen and Wu [14] treat an image as the combination of sparse noises and the low-rank matrix. They extract low-level features to form high-level priors and then incorporate the priors to a low-rank matrix recovery model for constructing the saliency map. However, the saliency assignment near the object is unsatisfying due to the ambiguity of prior maps. Liu et al. [28] formulate the saliency detection

3323

as a partial differential equation problem and solve it under an adaptive PDE learning framework. They learn the optimal saliency seeds via discrete submodularity and use seeds as boundary condition to solve the Linear Elliptic System. Inspired by these works, we construct a metric fusion framework which contains two complementary metric learning approaches to generate robust and accurate saliency maps even in complex scenes. Our method encodes low-level features into a high-dimensional feature space and incorporates multi-scale and objectness information when measuring saliency values. Therefore, our method can uniformly highlight objects with explicit object boundaries. III. P ROPOSED A LGORITHM In this section, we present an effective and robust adaptive metric learning method for visual saliency detection. The proposed algorithm proceeds through five steps to generate the final saliency map. Firstly, we extract low-level features to encode the superpixels generated by the simple linear iterative clustering (SLIC) [23] algorithm with a Superpixel-wise Fisher Vector representation. Secondly, two Mahalanobis distance metric learning approaches, Generic metric learning and Specific metric learning are introduced to learn the optimal distance measure of superpixels. Thirdly, we propose a novel seeds selection strategy based on the Mahalanobis distance to generate saliency seeds, which can be used to train Specific metric as training samples and evaluate the saliency values as referenced nodes. Fourthly, a metric fusion framework is presented to fuse the Generic and Specific metrics together. Finally, we obtain graceful and smooth saliency maps by combining the spectral clustering and multi-scale information. A. Superpixel-Wise Fisher Vector Coding (SFV) Appropriate feature coding approaches can effectively extract main information and remove the redundancies, thus greatly improving the performance of saliency detection. Fisher Vector can be regarded as an extension of the well-known bag-of-words representation, since it captures the first-order and second-order differences between local features and the centers of a Mixture of Gaussian Distributions. Recently, Chen et al. [29] extend Fisher Vector to the point level image representation for object detection. For a different purpose, we propose to further extend the FV coding to superpixel level and experimentally verify the superiority of our Superpixel-wise Fisher Vector coding method. Given a superpixel i = { pt , t = 1, . . . , T }, where pt is a -dimensional image pixel, and T is the number of pixels within i , wetrain a Gaussian mixture model (GMM) K λ ( p t ) = k=1 υk ψk ( pt ) from all the pixels of an image using the Maximum Likelihood (ML) criterion. The parameters of the K -component GMM are defined as λ = {υk , μk , k , k = 1, . . . , K }, where υk , μk and k are the mixture weight, mean vector and covariance matrix of Gaussian k respectively. Similar to the FV coding method, the SFV representation can be written as a  = 2K -dimensional concatenated form: ϕi = {ζμ1 , ζσ1 , . . . , ζμ K , ζσ K}

(1)

3324

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

where ζμk and ζσk are defined as: ζμk = T √1υk ζσk = T √1υk

T  t =1

T  t =1

k ηt (k) pt σ−μ , k

k) ηt (k) √1 { ( pt −μ − 1}, and σk is the square root σ2 2

2

k

of the diagonal values of k , ηt (k) is the soft assignment of pt to Gaussian k. The SFV representation ϕi is hereby used to describe superpixel i in this paper. It has several advantages: • As an extension of Fisher Vector coding, SFV successfully realizes superpixel level coding representation, making Fisher Vector more suitable for single image processing. Instead of averaging low-level features of contained pixels, SFV statistically analyzes the internal feature distribution of each superpixel, providing a more accurate and reliable representation for it. Experiments show that our SFV generates more smooth and uniform saliency maps and improves about 2 percent compared with low-level features in the precision-recall curve on the MSRA-1000 database as shown in Figure 7. • SFV can be regarded as an adaptive Fisher Vector coding, since the parameters of the GMM model are trained on a specific image online. This means even the same superpixels in different images have different coding representations. Therefore, our SFV better considers image contextual information. • Due to the small number of superpixels in an image and their disjoint nature, SFV is much faster than existing state-of-the-art FV variants. Furthermore, besides saliency detection, SFV can also be applied to other vision tasks, such as image segmentation and content-aware image resizing, etc. B. Adaptive Metric Learning Learning a discriminative metric can better distinguish the samples in different classes, as well as shortening the distance within the same class. Numerous models and methods have been proposed in the last decade, especially for the Mahalanobis distance metric learning, such as information theoretic metric learning (ITML) [30], large margin nearest neighbor (LMNN) [31], [32], and logistic discriminative based metric learning (LDML) [33]. However, most existing metric learning approaches learn a fixed metric for all samples without considering the deeper structure of the data, thereby breaking down in the presence of irrelevant or unreliable features. In this paper, we propose an adaptive metric learning approach, which considers both the global distribution of the whole training set (GML) and the specific structure of a single image (SML) to better separate objects from the background. Our approach can also be viewed as an integration of a supervised distance metric learning model (GML) and a semi-supervised distance metric learning model (SML). Since GML and SML are complimentary to each other, we get promising results after fusing them together under an affinity aggregation framework (Section III-D). 1) Generic Metric Learning (GML): Metric learning has been widely applied to vision tasks, but never been used for

saliency detection because of its long training time, which is infeasible for single image processing. In this part, we solve this problem by pre-training a Generic metric Mg from the first 500 images of MSRA-1000 database using gradient descent, and we verify, both experimentally and empirically, that Mg is generally suitable for all images. First, we construct a training set {ϕi , i = 1, 2, . . . , M} consisted of superpixels extracted from all training images, where ϕi is the SFV representation of superpixel i . To find the most discriminative Mg , we minimize  1  2  D(i, j ) (2) Mg∗ = arg min α  Mg  + Mg 2 n n n {i j |δi =1,δ j =0}

D(i, j ) = exp{−(ϕi − ϕ j ) Mg (ϕi − ϕ j )/σ12 } T

(3)

where δin is an indicator of the i t h superpixel in the n t h image belonging to the foreground or background, D(i, j ) is the exponential Mahalanobis distance between i and j under the distance metric Mg . We set σ1 = 0.1 to control the strength of distances. Considering that the background is various and chaotic, and different object regions are distinctive as well, we just impose restriction on pairwise distances between positive samples and negative ones, which is more reliable and reasonable for the fact that salient objects are always distinct from the background. This minimization aims at maximizing feature distances between foreground and background samples, thereby significantly improving the performance of saliency detection. Eqn 2 can be easily solved by gradient descent. The Generic metric includes the information of all superpixels in the whole training images, thus it is appropriate for most images. 2) Specific Metric Learning (SML): Recently, Wang et al. [34] propose a novel doublet-SVM metric learning approach based on Kernel Classification Framework, thus formulating the metric learning into a SVM problem and achieving desirable results with less training time. However, experiments show that directly applying doublet-SVM to saliency detection cannot ensure good detection accuracy. Therefore, we modify this approach by adding a constraint ω(τ 1,τ 2) , which significantly improves the performance of the final saliency map. Let {ϕi , i = 1, 2, . . . , m} be the training dataset, where ϕi is the SFV representation of a labeled superpixel extracted from a specific image. The detailed process of extracting labeled superpixels from an image will be discussed in Section III-C. We first divide these samples into foreground seeds and background seeds and label them as 1 and 0 respectively. Given a training sample ϕi with label h i , we find its q1 nearest neighbors with the same label and q2 nearest neighbors with different labels, and then (q1 + q2 ) doublets are constructed for it. Each doublet consists of the training sample ϕi and one of its nearest neighbors. By combining the doublets of all samples together, a doublet set χ = {x 1 , x 2 , . . . , x Z } is established, where x τ = (ϕτ,1 , ϕτ,2 ), τ = 1, 2, . . . Z is one of the doublets, and ϕτ,1 and ϕτ,2 are the SFV of superpixel τ 1 and τ 2 in doublet x τ , We assign x τ a label as follows: lτ = −1 if h τ,1 = h τ,2 , and lτ = 1 if h τ,1 = h τ,2 .

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION

3325

As an extension of degree-2 polynomial kernel, we define the doublet level degree-2 polynomial kernel as: K p (x τ , x ι ) T  ω (τ 1,τ 2) (ϕτ,1 − ϕτ,2 )(ϕτ,1 − ϕτ,2 ) = tr ω(ι1,ι2)(ϕι,1 − ϕι,2 )(ϕι,1 − ϕι,2 )T = ω(τ 1,τ 2) ω(ι1,ι2) {(ϕτ,1 − ϕτ,2 )T (ϕι,1 − ϕι,2 )}2

(4)

where ω(τ 1,τ 2) = θ(τ 1,τ 2) ∗ O (τ 1,τ 2) is a weight parameter. θ(τ 1,τ 2) = 1−exp(−di st(τ 1,τ 2)/σ2 ) O(τ 1,τ 2) = 1 − exp{−(Oτ 1 − Oτ 2 ) /σ2 } 2

(5) (6)

where di st(τ 1,τ 2) is the space distance between superpixel τ 1 and τ 2, and θ(τ 1,τ 2) is the corresponding exponential space distance. Oτ 1 is the objectness score defined as Eqn 11 of superpixel τ 1, and O(τ 1,τ 2) is the superpixel-wise objectness distance between τ 1 and τ 2. We set σ2 = 0.1. The weight parameter ω(τ 1,τ 2) provides crucial spatial and prior information regarding the interesting objects, thus it is more robust in evaluating the similarity between a pair of superpixels than the feature distance alone. In order to determinate the similarity of two samples in a doublet, we further define a kernel decision function as follows:  ατ lτ K p (x τ , x) + β} (7) E(x) = sgn{ τ

where ατ is the weight of doublet x τ , β is a bias parameter. We have  ατ lτ K p (x τ , x) + β τ

= ω(x1,x2)(ϕ x,1 − ϕ x,2 )T Ms (ϕ x,1 − ϕ x,2) + β  ατ lτ ω(τ 1,τ 2)(ϕτ,1 − ϕτ,2 )(ϕτ,1 − ϕτ,2 )T Ms =

(8) (9)

τ

For the facility of computation, we set ω(x1,x2)=1. The proposed Specific metric Ms can be easily solved by existing SVM solvers. The Specific metric is trained only on the test image, and it is much faster than existing metric learning approaches. According to [34], the doublet-SVM is 2000 times, on average, faster than the ITML [30]. Therefore, it is feasible to train a Specific metric for each image to better distinguish its objects from the background. In this part, we propose two metric learning approaches: GML and SML. The first one considers more about the global distribution of the whole training set, while the second one aims at exploring the deeper structure of a specific image. GML can be pretrained offline and is generally suitable for all images, while SML is much faster, since it can be solved by existing SVM solvers. We need to mention that the image specific is not always better than the Generic metric, as it has fewer training samples and less reliable labels. Instead, these two metrics are supposed to be complementary to each other and can be fused together to improve the performance of the final detection results.

C. Iterative Seeds Selection by Mahalanobis Distance (ISMD) As a preliminary criterion of saliency detection, saliency seeds directly influence the performance of seeds-based solutions. Recently, Liu et al. [28] propose an optimal seeds selection strategy via submodularity. By adding a stop criterion, the submodularity problem can be solved and then the optimal seed set is obtained accordingly. In [35], Lu et al. learn optimal seeds by combining bottom-up saliency maps and mid-level vision cues. Inspired by their works, we propose a compact but efficient iterative seeds selection scheme based on the Mahalanobis distance assessment (ISMD). Alexe et al. [24] present a novel objectness method to measure the likelihood of a given image window containing an object. Jiang et al. [18] extend the original objectness to Pixel-level Objectness O( p) and Region-level Objectness Oi by defining: O( p) =

W 

P(w)

(10)

w=1

Oi =

1  O( p) T

(11)

p∈i

where W is the number of sampling windows that contain pixel p, and P(w) is the probability score of the wt h window, T is the number of pixels within region i . We redefine the region-level objectness as superpixle-wise objectness in this paper. Motivated by the fact that highlights of the superpixle-wise objectness map are more likely to be the foreground seeds, a set of initial foreground seeds is constructed from the lightest two percent regions of the objectness map. Considering that the background is massive and scattered, we pick out several lowest objectness values from each boundary of the superpixel-wise objectness map as initial background seeds. The intuition is that if superpixel i is a foreground seed, the ratio of distances from foreground seeds and background seeds should be small. We formulate the ratio as follows:  drat (i, f s) fs

i = 

drat (i, bs)

(12)

bs

where drat (i, f s) = φ(i, f s) (ϕi − ϕ f s )Mg (ϕi − ϕ f s )T

(13)

is the Mahalanobis distance between superpixel i and one of foreground seeds f s under the Generic metric Mg , and φ(i, f s) = d(i, f s) ∗ O(i, f s) is a weight parameter, where 2 d(i, f s) = exp(−di st(i, f s) /σ2 )

(14)

is another kind of exponential space distance between superpixel i and f s. Only when i ≤ 0 or i ≥ 1 , i can be added to the foreground seeds set or background seeds set, where 0 and 1 are two thresholds. With the new added seeds each time, we iterate this process N1 times. Since most of the area in an image belongs to the background, in order to generate more background seeds, the iteration continues N2 times more, but only selects seeds

3326

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 4. Iterative seeds selection by Mahalanobis distance. Initial saliency seeds are first selected from the lightest and the darkest parts of the superpixelwise objectness map. By computing the Mahalanobis distance between any superpixel and the chosen seeds, we iteratively increase the foreground and background seeds.

with



drat (i, bs) ≤ 2 , where 2 is a threshold. Then we

bs

obtain the final seeds set as illustrated in Figure 4. As elaborated in Section III-B2, the Specific metric Ms can be learnt from the labeled seeds via doublet-SVM. One may concern that Ms will rely too much on Mg , since the labeled seeds are generated under Mg . Fortunately, by learning a generally suitable metric, we can enforce a very high seeds accuracy (98.82% on MSRA-1000 database) which means the seeds-based Specific metric is reliable enough to measure the distance. D. Metric Fusion for Extracting Spectral Clustering Characteristics Aggregating several affinity matrices appropriately may enhance the relevant and useful information, and at the same time, alleviate the irrelevant and unreliable one. Spectral clustering is an important unsupervised clustering algorithm for transferring the feature representation into a more discriminative indicator space, and we call this property as “spectral clustering characteristics”. Spectral clustering has been applied to many fields for its effective and outstanding performance. In this section, we merge the metric fusion into a spectral clustering features extraction process [36] and learn the optimal aggregation weight for each affinity matrix. The fusion strategy significantly improves the results of saliency detection as shown in Figure 5. Based on the two metrics learnt above, two affinity matrices g and s are constructed with the corresponding i j t h element

Fig. 5. Evaluation of metrics. (a) input images. (b) Generic metric. (c) Specific metric. (d) fused results. (e) ground truth.

problem can be conducted as:   g ϑg2 πi, j i −  j 2 + ϑs2 πi,s j i −  j 2 } min { ϑg ,ϑs i, j 1 ,...,r

i, j

= min {ϑg2 T (Hg − g ) + ϑs2 T (Hs − s )} ϑg ,ϑs 1 ,...,r

= min (βg ϑg2 + βs ϑs2 )

(16)

ϑg ,ϑs

where i is the clustering characteristic indicator of superpixel i , and r is the number of superpixels in an image, Hg = di ag{h 11, . . . , h rr } is the diagonal matrix of g with its g diagonal element h ii = πi, j , βg = T (Hg − g ). To solve j

this problem, we first employ two constraints: the normalized weight constraint ϑg + ϑs = 1 and the normalized spectral clustering constraint T H  = 1. By fixing ϑ, the clustering characteristic vector can be easily obtained using standard spectral clustering. If  is given, Eqn 16 can be formulated as: min (βg ϑg2 + βs ϑs2 ) = min (ρg μ2g + ρs μ2s ) μg ,μs

ϑg ,ϑs

(17)

subject to μ2g + μ2s = 1,

μg μs √ +√ =1 αg αs

(18)

√ β where αg = T Hg , ρg = αgg and μg = αg ϑg . This can be easily solved by existing 1D line-search methods. To summarize, metric fusion tries to find the optimal clustering characteristic vector  and the optimal weight parameter ϑ via a two-step iterative strategy. Since affinity matrices incorporate φ(i, j ) in Eqn 15, the convergence can be very fast, about three iterations in each image. We use the indicator representation to compute saliency maps (Section III-E).

g

πi, j = exp{−φ(i, j ) (ϕi − ϕ j )Mg (ϕi − ϕ j )T/σ3 } πi,s j = exp{−φ(i, j ) (ϕi − ϕ j )Ms (ϕi − ϕ j )T/σ3 }

(15)

where σ3 = 0.1. The affinity aggregation strategy aims at finding the optimal clustering characteristic vector  of all the superpixels in an image and the weight parameter ϑ = [ϑg , ϑs ]T associated with g and s , so the fusion

E. Context-Based Multi-Scale Saliency Detection In this section, we propose a context-based multi-scale saliency detection algorithm to compute the saliency map for each image. Lacking the knowledge of sizes of objects, we first generate superpixels in S different scales. Then the K-means algorithm is applied in each scale to segment an image into

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION

3327

Fig. 6. The distribution of saliency values of ground truth foregrounds and backgrounds. (a) Generic metric on MSRA-1000. (b) Specific metric on MSRA-1000. (c) AML on MSRA-1000. (d) AML on MSRA-5000.

N clusters via their SFV features. According to the intuition that a superpixel is salient if its cluster neighbors are close to the foreground seeds and far from the background seeds, we define the distance between superpixel i and saliency seeds in scale s as: (s)

(s)

Di,(s)f s (s) Di,bs

=

=

fn 

{γ i − q + (1 − γ)

Nc 

q=1

j =1

(s) bn 

Nc 

W i, j  j − q }

(s)

{γ i − q + (1 − γ)

W i, j  j − q }

j =1

q=1

(19) where W i, j = Q1 exp{−di st(i, j ) /σ2 } ∗ Q2 exp{−(Oi − O j )2 /σ2 }

Fig. 7. (a) Precision-recall curve for Generic metric, Specific metric, and fused results without neighbor smoothness (MSRA-1000 and Berkeley-300). Precision-recall curve based on SFV and low-level features. Precision-recall curve for other two fusion methods. (b) Images of fused results based on SFV and low-level features.

(20) is the weighted distance between superpixel i and its cluster neighbor j , and i is the clustering characteristic indicator of superpixel i , f n and bn are the number of foreground and background seeds chosen by our ISMD seeds selection approach. Q1 , Q2 and γ are weight parameters, Nc is the number of cluster neighbors of superpixels i . The saliency value of superpixel i can be formulated as: sal(i ) = =

S  s=1 S  s=1

ν s ∗ exp(Oi ) (s) 1 + {(1 − exp(−Di,(s)f s /σ4 )}/Di,bs (s) ν s ∗ exp(Oi ) ∗ Di,bs (s)

(s)

Di,bs + 1 − exp(−Di, f s /σ4 )

one is Berkeley-300 [38] which contains more challenging scenes with multiple objects of different sizes and locations. Since we have already used the first 500 images of MSRA-1000 for training, we evaluate our algorithm and compare it with other methods on the rest 500 images of MSRA-1000, 4500 images of MSRA-5000, where excludes 500 training images (MSRA-5000 contains all the images of MSRA-1000), 9501 images of THUS-10000 (THUS-10000 contains 499 training images), and Berkeley-300. A. Evaluation of Metrics

(21)

νs

where is the weight of scale s, and σ4 = 0.1. The considerations of all the other superpixels belonging to the same cluster and multiple scales smooth the saliency map effectively, and make our approach more robust in dealing with complicated scenes. IV. E XPERIMENTS We evaluate the proposed method on four benchmark datasets. The first one is MSRA-1000 [13], a subset of MSRA-5000, which has been widely used in previous works with its accurate human-labelled masks. The second one is MRAS-5000 dataset [15] which includes 5000 more comprehensive images. The third one is THUS-10000 [37] consists of 10000 images, each of which has an unambiguous salient object with pixel-wise ground truth labeling. The last

We perform several comparative experiments as shown in Figure 5, Figure 6 and Figure 7(a) to demonstrate the efficiency of Generic metric (GML), Specific metric (SML), and their combination (AML based on SFV). In order to eliminate the influence of neighbor smoothness, Eqn 19, when comparing metrics, we just compute the distance between each superpixel and seeds, instead of the sum of weighted distances of its cluster neighbors: (s)

Di,(s)f s

=

fn 

q=1

(s) i − q , Di,bs =

(s) bn 

i − q

(22)

q=1

The precision-recall curves of the Generic metric and Specific metric are almost the same, but their combination outperforms both of them. We also try to add or multiply saliency maps generated by these two metrics directly, but the PR curves are much lower than our fusion approach in Figure 7(a). This is consistent with our motivation: Mg is trained from the

3328

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 8. Results of different methods. (a), (b) Precision-recall curves on MSRA-1000. (c) Average precisions, recalls, F-measures and AUC on MSRA-1000. (d), (e) Precision-recall curves on MSRA-5000. (f) Average precisions, recalls, F-measures and AUC on MSRA-5000.

Fig. 9. Results of different methods. (a), (b) Precision-recall curves on THUS-10000. (c) Average precisions, recalls, F-measures and AUC on THUS-10000. (d), (e) Precision-recall curves on Berkeley-300. (f) Average precisions, recalls, F-measures and AUC on Berkeley-300.

whole training dataset, containing the global distribution of the data, and Ms aims at a single image, considering the specific structure of samples. Figure 5 demonstrates that the fused results significantly remove the light saliency values in the background regions produced by GML and SML. Since most parts in computing saliency maps under different metrics are the same, e.g., objectness prior map, seeds selection, etc., it is reasonable that Figure 5 (b) and (c) are similar, but there are still differences between them. To further prove this, we conduct an extra experiment as shown in Figure 11. The second line is the results generated by fusing the GML with itself, the third line is the results generated by fusing the SML with itself, and the fourth line the obtained by fusing the GML and SML. We call them as GG, SS, and AML respectively. Limited by the image resolution, some differences between the GML and SML may not be find in Figure 5, but the integration with the metric itself can apparently enlarge their distinctiveness. Furthermore, if one metric is incorrect, another one can make up it. The SS performs better than the GG in Figure 11 (a)-(e), while the GG is better in (f)-(g), and the AML tends to take the best results of them, which demonstrates that the GML and SML are indeed complimentary to each other and improve the performance of saliency detection after fusion. Figure 11 (k)-(m) show that if both the GML and SML get bad results, the results after fusion are still bad. In addition, we plot the distribution of saliency values in Figure 6. Ground truth masks provide a specific label, 1 or 0, for each pixel and we regard a superpixel as foreground when more than 80% pixels of it are labelled by 1. Otherwise, the superpixel will be background. We put all the foreground superpixels from the whole dataset together and get the distribution of their saliency values computed by different saliency methods as the red line. The blue line is the

distribution of saliency values of background superpixels. Figure 6(a), (b), (c) are the saliency distribution produced by GML, SML and AML on MSRA-1000 respectively. Figure 6(d) is AML on MSRA-5000. This shows that AML is better than GML and SML, since its background saliency values are closer to 0. Furthermore, our Generic metric is robust to different databases. We use the metric trained from MSRA-1000 to all the databases, including MSRA-1000, MSRA-5000, THUS-10000, and Berkeley-300. As shown in Figure 8 and Figure 9, the results are still promising even on different databases, which demonstrates the effectiveness and adaptiveness of our Generic metric. Overall, the fused results based on two outstanding and complementary metrics achieve higher precision and recall values and generate more accurate saliency maps. B. Evaluation of Superpixel-Wise Fisher Vector We have mentioned that our Superpixel-wise Fisher Vector coding approach can improve the performance of saliency detection by capturing the average first-order and second-order differences between local features and the centers of a Mixture of Gaussian Distributions. In experiments, we extract the low-level features: RGB and LAB to learn a 12D SFV representation for each superpixel ( = 6, K = 1,  = 2K = 12). Figure 7(a) shows the efficiency of our SFV coding approach by comparing the precision-recall curves of low-level features and the SFV on MSRA-1000 database. Figure 7(b) are corresponding images. C. Evaluation of Saliency Maps We compare the proposed saliency detection model with several state-of-the-art methods: IT [5], GB [19], FT [13],

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION

3329

Fig. 10. The comparison of previous methods, our algorithm and ground truth. (a) Test image. (b) IT [5]. (c) GB [19]. (d) GC [39]. (e) CB [44]. (f) UFO [18]. (g) Proposed. (h) Ground truth.

GC [39], UFO [18], SVO [40], HS [41], PD [42], AMC [43], RCJ [37], DSR [20], DRFI [26], CB [44], RC [6], LR [14] and XL [45]. We use source codes provided by the authors or implement them based on the available codes or softwares. We conduct several quantitative comparisons of some typical saliency detection methods. Figure 8(a), (b), (d) and (e) show that the proposed AML is comparable with most of the state-of-the-arts on MSRA-1000 and MSRA-5000 databases. Figure 8(c) and (f) are the comparisons of average precision, recall, F-measure and AUC. We use AUC as an evaluation criteria, since it represents the area under the PR curve and can effectively reflect the global properties of different algorithms. Instead of using the bounding boxes to evaluate the saliency detection performances on MSRA-5000 database, we adopt the accurate human-labeled masks provided by [26] to ensure more reliable comparative results. We also perform experiments on THUS-10000 and Berkeley-300 databases as shown in Figure 9. Precision-recall curves show that AML reaches 97.4%, 94.0%, 96.5%, 81.5% precision rate on MSRA-1000, MSRA-5000, THUS-10000, and Berkeley-300 respectively. All of them demonstrate the efficiency of our method.

Figure 10 shows some sample results of five previous approaches and our AML algorithm. The IT and GB methods are capable in finding the salient regions in most cases, but they tend to highlight the boundaries and miss lots of object information because of the blurriness of saliency maps. The GC method cannot contain all the salient pixels and often mislabels small background patches as salient regions. The CB and UFO models can highlight the objects uniformly, but they become invalid in dealing with challenging scenes. Our method can catch both the small and large salient objects even in complex environments. In addition, we can highlight the objects uniformly with accurate boundaries and do not need to care about the number and locations of the salient objects. We also test the average computational cost on different datasets: 18.15s on MSRA-1000, 18.42s on MSRA-5000, 17.90s on THUS-10000 and 18.78s on Berkeley-300. The proposed algorithm is implemented in MATLAB on a PC machine with Intel i7-3370 CPU (3.4 GHz) and 32 GB memory. D. Evaluation of Selected Seeds We train an effective Specific metric based on the assumption that the selected seeds are correct. In experiments,

3330

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 11. Example results of different metrics. The first line is the input images, the second line is the results generated by fusing the GML with itself, the third line is the results generated by fusing the SML with itself, the fourth line is obtained by fusing the GML and SML, and the last line is the ground truth images.

we cannot ensure that the chosen seeds are completely accurate, but we can enforce a very high seeds accuracy. The accuracy of selected seeds is defined as follows: f sc + bsc f sc + bsc (23) sa = = f st + bst ( f sc + f sic ) + (bsc + bsic ) where f sc = bsc =

 n  i  n

(gtin &seedin ) (gtin &seedin )

(24)

i

i represents the i t h superpixel extracted from the n t h image of a typical database. gtin and seedin are the ground truth and label assigned by our seeds selection mechanism of i . The accuracy rates of four databases are: 0.9882 on MSRA-1000, 0.9769 on MSRA-5000, 0.9822 on THUS-10000 and 0.8874 on Berkeley-300. We experimentally verify that the seeds are accurate enough to generate a reliable Specific metric for each image. V. C ONCLUSION In this paper, we explicitly propose two Mahalanobis distance metric learning models and a superpixel-wise fisher vector representation for visual saliency detection. To our knowledge, we are the first to apply metric learning to saliency detection and conduct a metric fusion mechanism to improve the detection accuracy. Different from previous methods, we adopt a new feature coding strategy and make the supervised metric learning more suitable for single image processing. In addition, we propose an accurate seeds selection method based on the Mahalanobis distance measure to train the Specific metric and construct the final saliency map. We estimate the saliency value of each superpixel from a multi-scale view and include the contextual information when computing it. Experimental results with sixteen state-of-the-art algorithms on four benchmark image databases demonstrate the efficiency of our metric learning approach and the saliency detection model. In the future, we plan to explore more robust object detection approaches to further improve the accuracy of saliency detection.

R EFERENCES [1] C. Siagian and L. Itti, “Rapid biologically-inspired scene classification using features shared with visual attention,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 300–312, Feb. 2007. [2] H. Liu, X. Xie, X. Tang, Z.-W. Li, and W.-Y. Ma, “Effective browsing of Web image search results,” in Proc. 6th ACM SIGMM Int. Workshop Multimedia Inf. Retr., 2004, pp. 84–90. [3] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 still image coding system: An overview,” IEEE Trans. Consum. Electron., vol. 46, no. 4, pp. 1103–1127, Nov. 2000. [4] Y. Niu, F. Liu, X. Li, and M. Gleicher, “Warp propagation for video resizing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 537–544. [5] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998. [6] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 409–416. [7] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and mid level cues,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1689–1698, May 2013. [8] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 3166–3173. [9] J. Sun, H. Lu, and X. Liu, “Saliency region detection based on Markov absorption probabilities,” IEEE Trans. Image Process., vol. 24, no. 5, pp. 1639–1649, May 2015. [10] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention analysis by using fuzzy growing,” in Proc. 11th ACM Int. Conf. Multimedia, 2003, pp. 374–381. [11] J. Sun, H. Lu, and S. Li, “Saliency detection based on integration of boundary and soft-segmentation,” in Proc. IEEE Int. Conf. Image Process., Sep./Oct. 2012, pp. 1085–1088. [12] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2007, pp. 1–8. [13] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk, “Frequency-tuned salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 1597–1604. [14] X. Shen and Y. Wu, “A unified approach to salient object detection via low rank matrix recovery,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 853–860. [15] T. Liu et al., “Learning to detect a salient object,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011. [16] J. Yang and M.-H. Yang, “Top-down visual saliency via joint CRF and dictionary learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2296–2303. [17] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the Fisher vector: Theory and practice,” Int. J. Comput. Vis., vol. 105, no. 3, pp. 222–245, 2013.

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION

[18] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection by UFO: Uniqueness, focusness and objectness,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1976–1983. [19] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Proc. Adv. Neural Inf. Process. Syst., 2006, pp. 545–552. [20] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection via dense and sparse reconstruction,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 2976–2983. [21] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 2814–2821. [22] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 733–740. [23] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, Nov. 2012. [24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2189–2202, Nov. 2012. [25] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background priors,” in Proc. 12th Eur. Conf. Comput. Vis. (ECCV), 2012, pp. 29–42. [26] H. Jiang, Z. Yuan, M.-M. Cheng, Y. Gong, N. Zheng, and J. Wang. (2014). “Salient object detection: A discriminative regional feature integration approach.” [Online]. Available: http://arxiv.org/abs/1410.5926 [27] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2083–2090. [28] R. Liu, J. Cao, Z. Lin, and S. Shan, “Adaptive partial differential equation learning for visual saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 3866–3873. [29] Q. Chen et al., “Efficient maximum appearance search for large-scale object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 3190–3197. [30] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 209–216. [31] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 1473–1480. [32] K. Q. Weinberger and L. K. Saul, “Fast solvers and efficient implementations for distance metric learning,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1160–1167. [33] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? Metric learning approaches for face identification,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 498–505. [34] F. Wang, W. Zuo, L. Zhang, D. Meng, and D. Zhang. (2013). “A kernel classification framework for metric learning.” [Online]. Available: http://arxiv.org/abs/1309.5823 [35] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal seeds for diffusion-based salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 2790–2797. [36] H.-C. Huang, Y.-Y. Chuang, and C.-S. Chen, “Affinity aggregation for spectral clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 773–780. [37] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2014. [38] V. Movahedi and J. H. Elder, “Design and perceptual validation of performance measures for salient object segmentation,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2010, pp. 49–56. [39] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook, “Efficient salient region detection with soft image abstraction,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1529–1536. [40] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai, “Fusing generic objectness and visual saliency for salient object detection,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 914–921. [41] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 1155–1162.

3331

[42] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a patch distinct?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 1139–1146. [43] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency detection via absorbing Markov chain,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1665–1672. [44] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li, “Automatic salient object segmentation based on context and shape prior,” in Proc. BMVC, 2011, pp. 110.1–110.12 [45] Y. Xie and H. Lu, “Visual saliency detection based on Bayesian model,” in Proc. 18th IEEE Int. Conf. Image Process., Sep. 2011, pp. 645–648. Shuang Li is currently pursuing the B.E. degree with the School of Information and Communication Engineering, Dalian University of Technology (DUT), China. From 2012 to 2015, she was a Research Assistant with the Computer Vision Group, DUT. Her research interests focus on saliency detection and object recognition.

Huchuan Lu (SM’12) received the M.Sc. degree in signal and information processing and the Ph.D. degree in system engineering from the Dalian University of Technology (DUT), Dalian, China, in 1998 and 2008, respectively. He joined as a Faculty Member in 1998, and is currently a Full Professor with the School of Information and Communication Engineering, DUT. His current research interests include the areas of computer vision and pattern recognition with a focus on visual tracking, saliency detection, and segmentation. He is also a member of the Association for Computing Machinery and an Associate Editor of the IEEE T RANSACTIONS ON S YSTEMS , M AN AND C YBERNETICS —PART B. Zhe Lin (M’10) received the B.Eng. degree in automatic control from the University of Science and Technology of China, in 2002, the M.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, in 2004, and the Ph.D. degree in electrical and computer engineering from the University of Maryland, College Park, in 2009. He has been a Research Intern with Microsoft Live Labs Research. He is currently a Senior Research Scientist with Adobe Research, San Jose, CA. His research interests include deep learning, object detection and recognition, image classification and tagging, content-based image and video retrieval, human motion tracking, and activity analysis. Xiaohui Shen (M’11) received the B.S. and M.S. degrees from the Department of Automation, Tsinghua University, China, and the Ph.D. degree from the Department of Electrical Engineering and Computer Sciences, Northwestern University, in 2013. He is currently a Research Scientist with Adobe Research, San Jose, CA. He is generally interested in the research problems in the area of computer vision, in particular, image retrieval, object detection, and image understanding. Brian Price received the Ph.D. degree in computer science from Brigham Young University under the advisement of Dr. B. Morse. He has contributed new features to many Adobe products, such as Photoshop, Photoshop Elements, and After-Effects, mostly involving interactive image segmentation and matting. He is currently a Senior Research Scientist with Adobe Research, specializing in computer vision. His research interests include semantic segmentation, interactive object selection and matting, stereo and RGBD, and broad interest in computer vision and its intersections with machine learning and computer graphics.

Adaptive Metric Learning for Saliency Detection.

In this paper, we propose a novel adaptive metric learning algorithm (AML) for visual saliency detection. A key observation is that the saliency of a ...
7MB Sizes 1 Downloads 11 Views