This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

1

Unsupervised Co-segmentation for Indefinite Number of Common Foreground Objects Kunqian Li, Student Member, IEEE, Jiaojiao Zhang and Wenbing Tao∗ , Member, IEEE,

Abstract—Co-segmentation addresses the problem of simultaneously extracting the common targets appeared in multiple images. Multiple common targets involved object co-segmentation problem, which is very common in reality, has been a new research hotspot recently. In this paper, an unsupervised object co-segmentation method for indefinite number of common targets is proposed. This method overcomes the inherent limitation of traditional proposal selection based methods for multiple common targets involved images while retaining their original advantages for objects extracting. For each image, the proposed multi-search strategy extracts each target individually and an adaptive decision criterion is raised to give each candidate a reliable judgement automatically, i.e. “target” or “non-target”. The comparison experiments conducted on public datasets iCoseg, MSRC as well as a more challenging dataset Coseg-INCT demonstrate the superior performance of the proposed method. Index Terms—Co-segmentation, multi-object discovery, adaptive feature, loopy belief propagation.

I. Introduction S the era of big data arrives, it puts forward high demands on data processing ability. In the field of image segmentation, which is important foundation of many other senior computer vision applications, it has become a new research hotspot that how to make full use of consistent information shared by different images to assist the segmentation task of every single image. Therefore, image co-segmentation algorithms emerge as required [1]–[5] which aim at extracting the common targets simultaneously. Among all the co-segmentation algorithms, segmentation proposal selection based approaches [5]–[7] have recently drawn more attention. These approaches aim at picking out the real common objects from the segmentation proposals, and the assumption that the common targets should be real-world objects makes cosegmentation much more efficient. However, these approaches still bear obvious defects. In this paper, we extend the previous proposal selection based co-segmentation methods by making the following three major contributions. 1) The key issue of the proposal selection based cosegmentation problems lies in mining consistent information shared by the common targets. Due to the uncertainty of the shared features, it usually requires manually selecting features [6] or feature learning performed beforehand [5], [7]. In this paper, a very simple but highly effective self-adaptive feature selecting strategy is introduced.

A

The authors are with National Key Laboratory of Science & Technology on Multi-spectral Information Processing, School of Automation, Huazhong University of Science and Technology, Wuhan 430074, China. ∗ Correspondence Author:[email protected], Tel. +86 27 87540164

2) Most of the existing proposal selection based image co-segmentation algorithms [5]–[7] usually assume that each image only contains a single common target. And then co-segmentation is formulated as a combination optimization problem, i.e. selecting an optimal proposal for each image by maximizing the overall similarity score. However, such subjective assumption limits its application. For multiple common targets involved images, these co-segmentation approaches usually fail to extract all the targets. In this paper, a proposal selection based unsupervised co-segmentation method for multiple common targets is introduced. 3) For multiple common targets involved images, an obvious alternative is to employ multi-class co-segmentation approaches. However, due to the significant appearance variance and the inconsistent number of the common targets, the ideal class numbers for different images are usually not the same. Moreover, another defect of the multi-class co-segmentation methods is that some combinational common targets are usually splitted into multiple pieces. In this paper, we propose an adaptive strategy that can handle the indefinite number of common targets involved cases, where each image may contain different numbers of common targets. The rest of the paper is organized as follows. After a review of the related works in Section II and a description of the established mathematical model for the aimed problem in Section III, we will introduce the proposed co-segmentation method for indefinite number of common targets involved images in Section IV. In Section V, comparison experiments conducted on the public datasets and the newly built dataset demonstrate the good performance of the proposed method. In Section VI, necessary discussion is given. Finally, concluding remarks are presented in Section VII. II. Related work Since image co-segmentation was first put forward by Rother et al. [1], it has received a lot of attention. Numerous scholars and experts have made many creative researches and provided many effective approaches. As known to all, the key point of co-segmentation problem is mining common information and performing consistent constraint on segmentation process. Foreground histogram consistency is the first explored common information [1], and the usual way is to incorporate the corresponding additional constraint energy into the MRF framework. Along this thought, scholars put their effort on discovering other types of histogram consistency terms so as to

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

simplify the optimization [8]–[10] and extend the application [11] as well. However, color histogram consistency based approaches can only handle the targets which share the same color distribution, which limits their application. Hereafter, investigators paid more attention to general cosegmentation problems, i.e. segmentation algorithms are not only designed for the same object but also for the targets of the same category and even for multi-class common objects. Invariant features are considered and strategies from other perspectives are proposed. Kim et al. [4] raised the first co-segmentation algorithm that aims at handling multi-class co-segmentation problem. They encoded the co-segmentation problem as maximizing the heat gain of the heat diffusion system, which is built on the whole image group. The optimal positions of heat sources correspond to the maximum heat gain. The final segmentation could be achieved by solving a K-seed random walk problem, where the heat sources could be viewed as the seeds and K is the number of classes. This approach provides an innovative strategy to solve the co-segmentation problem. Joulin et al. [2] combined spectral method and kernel method within a discriminative framework to perform co-segmentation. In [3], they extended their method to multi-class co-segmentation by formulating an energy function with probabilistic interpretation. In these two methods, invariant features like SIFT [12] are adopted to handle the significant color variance involved situation. As a multi-class co-segmentation method, [3] has the same defect with [4], i.e., the number of classes K should be manually set. Furthermore, if the real K is not uniform for each image, a fixed setting is apparently unreasonable. In [13], Mukherjee et al. proposed to automatically extract multiple objects of multi-class by analyzing their common subspace structure. Same as [3] and [4], here only pixel-wised low-level consistent information are exploited. [14]–[16] belong to another kind of multi-class cosegmentation algorithms, which are designed for a more general co-segmentation problem, i.e., multi-class common objects are contained in the image collection but each target may not always appear in every image. Kim et al. [14] designed an iterative scheme which combines foreground modeling and region assignment. And this approach should input the class number K beforehand. [15] and [16] are both unsupervised co-segmentation appraoches. [15] designed an ensemble clustering scheme to discover the implied objects. [16] introduced multiple groups of consistent segmentation functions to establish partial consistent relationships across different images. An obvious drawback for such multi-class methods is that those combinational targets are usually splitted into multiple pieces in the final segmentations. Recombining those foreground pieces and picking out the desired segmentations are still very cumbersome. With the increasing diversity of consistent information, feature selection has become another disturbing problem. Take [2], [3] for examples, users should manually set the adopted feature (either color or SIFT) beforehand. In [17], Chai et al. combined GrabCut and SVM for Bi-level segmentation, i.e., pixels segmentation with GrabCut for individual image, and superpixels segmentation with SVM for different images. To

2

(a) I1

I2

I3

... p11

p12

p13

p14

p15

p

p

p

p

...

(b) p

1 2

2 2

3 2

4 2

5 2

... p31

p32

p33

p34

p35

(c) R1

R2

R3

Fig. 1. Co-segmentation for indefinite number of common targets involved images. (a) Source images from Woman Soccer Players Red dataset of iCoseg [20]. It can be noticed that, different numbers of common targets (soccer players in red shirts) are involved in each image. (b) The top-ranked object proposals achieved with [21]. Co-segmentation turns into giving each proposal a label which denotes whether it is a common target. The expected common target proposals are contained in red boxes. (c) The desired final segmentation which contains all the common targets.

character the superpixels, multiple features such as color, SIFT, shape, size, location are intergraded within a 1076-D vector, however the weights for different features were not mentioned. [18], [19] also got superpixels involved. In [18], Kim et al. proposed to construct a whole hierarchical graph that covers multi-scale superpixels of all images, and then image cosegmentation becomes a spectral decomposition problem. In [19], Rubio et al. incorporated region matching and pixelregion consistent constraint with traditional MRF framework. Here, both [18] and [19] use color and SIFT/SURF histograms to measure the similarity of superpixels, but the weights for different terms are fixed, which is rarely flexible and accurate enough. Meng et al. [7] proposed a feature adaptive cosegmentation method, which aims at learning an appropriate feature combination from the images with low complexity. This is a pioneering work in image co-segmentation, however, when all the images are with high complexity, this strategy usually seems to fail to work well. In addition, when the common targets in those simple images are far from the ones of the other images in feature space, the learned feature will lose its representativeness. Chang et al. [22] proposed to construct the data terms of MRF with co-saliency prior and designed a novel global constraint term, which finally results in a submodular energy function that can be efficiently optimized with graph cut. Collins et al. [23] raised a random walker based co-segmentation algorithm instead of the traditional MRF framework, which leads to an easier and faster optimization than the latter. Rubinstein et al. [24] proposed to use dense correspondence among the whole database to capture the sparsity of common objects as

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

well as distinguish the images where common objects are absent. Dai et al. [25] and Tao et al. [26] both proposed the shape consistency based co-segmentation approach, which perform well for objects with similar outlines. Besides, other auxiliary information such as image depth is also applied into co-segmentation tasks to enhance the target consistency [27]. As mentioned before, due to the ambiguity of content understanding and the inherent limitation of low-level cues, unsupervised image segmentation itself is a challenging task and a good partition is not easy to obtain. Object proposals/patches based high-level representation has been demonstrated to be a better choice for image and video segmentation/cosegmentation applications [5], [27]–[30]. Actually, target proposal generation has recently become an active branch of image segmentation. These approaches aim at providing more accurate segmentation proposals within less hypotheses. Among numerous proposal generation methods, object proposal generation approaches such as [31] and [21], whose goals are to extract only object-like targets, have drawn much more attention. In [31], a group of segmentation proposals are generated by solving a sequence of constrained parametric graph cuts problems, and then these hypotheses will be ranked based on their mid-level region properties, which reflect their probability to be real-world objects. Similarly, Endres et al. [21] combined boundary and shape cues as well as low-level information to generate category independent object proposals, and then those proposals will be ranked with the trained classifiers. It should be noticed that if we ignore the increase of computational effort, a satisfactory segmentation could always be achieved when there are plenty of hypotheses. However, the following mission, ranking these segmentation proposals according to their possibilities to be real targets, will be getting much more complicated. Picking out the optimal one becomes even a more challenging task and the results may be unreliable. However, if we combine it with the thought of co-segmentation problem, consistency information could be viewed as another proposal ranking basis. In [5], Vicente et al. first proposed to cooperatively select the optimal object proposals as the cosegmentation results. Here, that whether an object proposal is appropriate to be the foreground segmentation or not is closely related with its similarity to the chosen proposals of the other images, and the final selections could be achieved by maximizing the sum of similarities of all the selected proposal pairs. Meng et al. [6] simplified this strategy by constructing a directed graph, where proposals of each image are arranged as columns and weighted graph edges only exist between the proposals of adjacent columns. Finally, the proposal selection results could be achieved by searching the shortest path from the additional starting node to the virtual ending node. As previously reviewed, Meng et al. [7] extended [6] by creatively introducing an image complexity analysis based feature learning strategy. Although it could learn accurate feature combination for the simple images, the representativeness of those simple images are still one-sided when compared with the whole image group. Besides, as we have discussed above, [5], [6] and [7] are all based on the assumption that every image contains only a single target,

3

which is very limited. In the field of video co-segmentation, Fu et al. [32] proposed a proposal selection based multiple foregrounds video co-segmentation method, but it can only handle fixed numbers of targets and their method cannot be directly applied for image collection segmentation, where few inter-frame information are available. In this paper, we propose an object proposal selection based multiple common targets co-segmentation method for image collections, which can flexibly handle different numbers of common objects involved problems. Moreover, a simple but effective feature selecting method is proposed. In the following section, we will give a more detailed description of the proposed method. III. Problem formulation As presented in Fig. 1, given an image set I = {Ii }, i = 1, . . . , M, where the images may contain different numbers of targets, our goal is to extract all the common targets. For every single image Ii , we first generate the object proposal set Pi = {pki }, k = 1, . . . , Ki and set a relative large value for K to make sure that the object proposal set could cover all the potential common targets. And then the co-segmentation problem for indefinite number of common targets can be transformed as a labeling problem, i.e., give each pki a binary assignment xik , where xik = 1 means pki is determined as a foreground common target, and for xik = 0, the meaning is reversed. The union set of the selected foreground proposals could be viewed as the final segmentation result, i.e.,  {pki |xik = 1, k ≤ Ki }. (1) Ri = In our approach, we formulate this co-segmentation task for indefinite number of common targets as a labeling problem in a completely connected network, where each object proposal corresponds to a node and they are connected with the weighted edges. The multiple proposals selection of each image will be conducted separately, but it is closely related with the other images of the collection. For each image Ii , we will choose a proposal in every selecting loop to be the real common foreground, and before we conduct the next selecting loop, we will remove the node of the last chosen proposal to make sure that a new proposal will be considered to be a target. Whether the newly selected one will be chosen as the real target totally depends on the labels of the other images. Therefore, segmentation problem of image Ii finally becomes finding an optimal labeling set xi = {xik |k = 1, . . . , Ki ; xik ∈ {0, 1}} by maxmizing the following energy function, ⎛ ⎞      ⎟ Ni ⎜  ⎜⎜⎜   ⎟⎟⎟ v v w ⎟ u Ei (x) = W x , x W x , x + ⎜⎝ ⎠ s.t.

i,n j,n n=1 ji xiu &xvj =1 j,hi xvj &xhw =1 Kl  k xl,n = 1, u ≤ Ki , v ≤ K j , w ≤ Kh , Nl,n = k=1 Ni  k xi,n ≤ 1, #{ ji x j,n ⊕ ji x j,1 } ≤ T . n=1

j,n

h,n

(2)   W xiu , xvj is the weight of edge that connecting the proposal   pair pui , pvj in image Ii and I j , which represents their similarity as well as the confidence of selecting them both as the foreground targets. Note that, only when xiu &xvj = 1, W xiu , xvj

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

P1

I1

I2

P2

I3

P3

­ ® ¯

p11

­ ® ¯

p12

­ ® ¯

1st search

2nd search

p32

p14

p23

p24

p33

p34

p15

½ ...¾ ¿

p25

...½ ¾ ¿

p35

½ ...¾ ¿

(b) Object proposals gereration

2nd search: Target Targget 2

3rd search

p13

p22

p31

(a) Source images

1st search: Target Targget 1

p12

4

3rd search: Target Targget 3

4th search: Terminat Terminate te

1st search

Terminate

Terminate

2nd search

(d) Target multi-search for each image

P3 P1

P2 Proposals of

I1

Proposals of

I2

Proposals of

I3

(c) Weighted graph construction

R1

R2

R3

(e) Final segmentation of each image

Fig. 2. The overall framework of the proposed co-segmentation approach for multiple common objects involved images. (a) Source images from Woman Soccer Players Red group of iCoseg, where different numbers of common targets (players in red shirt) are involved. (b) Target proposal set generation for each image. (c) Construct complete weighted graph, where the nodes in different colors present the proposals of different images and the thickness of the edges corresponds to the similarity between the connected two proposals. Here, for clarity, only part of the edges are shown, the edges across different images are colored as black while the edges within each image use their own specified color. (d) The multi-search process. Multiple targets of each image are individually extracted during each searching process. Besides, an adaptive terminal condition is proposed. (e) The final segmentation of each image.

  is non-zero and is equal to, in numeral, the similarity S pui , pvj which will be introduced in Section IV-C. xi,n is the labels of proposals in the n-th selecting process of Ii . Ni is the total number of selected proposals, and it can be automatically determined by the adaptive label changing threshold T which will be introduced in Section IV-D. Nl,n = 1 means for each image Il , only one proposal could be selected in per loop and Ni  k xi,n ≤ 1 means every proposal in image Ii could be selected n=1

only once throughout the selecting procedure. The formulation in Eq. 2 is based on the observation that the common targets usually share the same characters, and by maximizing the overall similarity with the additional constraint, we can make sure that the newly chosen object proposal in each search of Ii is always the most similar one to the chosen proposals of the other images. Such an optimization problem in Eq. 2 can be solved with a greedy approach, i.e., repeatedly maximizing the following sub-function         w u W xi,n , xvj,n + W xvj,n , xh,n Ei,n (x n) = s.t.

ji xiu &xvj =1 Kl  k Nl,n = xl,n k=1

j,hi xvj &xhw =1

= 1, l = {1, . . . , M} (3) of last loop until it

and removing the chosen proposal pui,n dissatisfies #{ ji x j,n ⊕ ji x j,1 } ≤ T , (4) where ji x j,n is the label set of the other images except image Ii . We compare ji x j,n with the labels of first loop



ji x j,1 , only when the number of changed x is less than the given threshold T , the newly chosen proposal will be regarded as a real target. So this terminate condition will play an important role in target/non-target judging. The sub-problem encoded in (3) can be solved with an exact A∗ -search algorithm [33], [34] or loopy belief propagation for approximation when the number of images is large.

IV. Co-segmentation for indefinite number of common targets As we discussed above, the key point of the indefinite number of common targets involved co-segmentation problem lies in adaptively determining the number of targets, which requires fully extracting the potential targets and then mining the consistent relationships shared by the common targets. Extracting the potential targets means discovering all the implied objects and mining the consistent relationships is required to determine the common interesting target that reappears in every image. In this section, we will give a detailed description of our co-segmentation method. A. Overall framework The flow chart of the proposed method is presented in Fig. 2. Firstly, target candidates are generated with categoryindependent object proposal generation method in [21]. And then, as we discussed in Sec. III, a completely connected graph that connects all the generated proposals is constructed, where the edge weights correspond to the proposal similarities. Here,

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

5

Algorithm 1: Adaptive feature weight selecting algorithm

(a)

Input : Proposal sets P = {P1 , P2 , · · · , P M }. Output: Feature relative weight α, initial label assignments x.

Expected weight bar:

1: Initialization: α = 0.5, α = 0. 2: while α  α do 3: Weighted graph construction with α. 4: Update x by solving  the sub-problem in (3). W(xiu , xvj , α) − λ · σ2 (x, α). 5: α ← α = arg max

(b)

0≤α≤1 xiu =xvj =1

Expected weight bar: Weight for color

6: end while

Weight for BoF

Fig. 3. Image groups with different common features. (a) Kendo dataset from iCoseg [35]. (b) Flowers dataset from MSRC [20].

to establish a reliable similarity measurement for proposals with differen features in common, we introduce an adaptive feature weight selecting algorithm. The final module performs multiple common targets searching, where multiple common targets of each image are individually extracted. Besides, an adaptively searching terminal condition is designed as the common target judging criterion. When the multi-search process terminates, we could achieve the final segmentation by collecting all the selected proposals. In the following subsections, we will present more detailed descriptions about the proposed co-segmentation method. B. Object proposals generation The quality of object proposal pool directly impacts the performance of proposal selection based co-segmentation method. Moreover, the proposed multi-search strategy even puts forward higher requirements on it. The measurement of the quality of proposal pool usually contains two aspects, the diversity and the representativeness. The diversity means the proposal pool should cover as many objects as possible, and the representativeness means the proposal pool should contain as few candidates as possible for each object. Encouraging the diversity of proposal pool is aimed at avoiding the omission of the interesting targets. And the representativeness is more required by our multi-search strategy, whose aim is to avoid repeatedly extracting the same target as well as to reduce the storage and computation burden. In our approach, we adopt [21] to obtain the object proposal pool for each image. In this category-independent object proposal generation method, after a large number of proposals are achieved, a scoring mechanism that combines appearance features and overlap penalty is raised for proposal ranking. Finally, the proposals corresponding to different potential objects with accurate object segmentations will be top-ranked, which highly aligns with the above-mentioned requirements for object proposal pool. However, like most of the other proposal generation methods, a major flaw of [21] is that, for complex combinational targets each proposal usually only contains a local part of them. Apparently, directly adopting traditional proposal selection based co-segmentation methods, such as [5], [6], will lead to partly cutouts. But the proposed method could make up such loss by conducting multiple targets searching.

C. Weighted graph construction The kernel of co-segmentation problems lies in mining the consistent information implied in the common targets. Specifically, for proposal selection based co-segmentation approaches, the usual way is measuring the similarity between every two proposals. Considering that the shared similar features are usually blind for algorithms beforehand, choosing fixed features for similarity measuring is not usually a good option. Therefore, adopting a flexible and reliable proposal similarity measurement has become a key aspect of co-segmentation. In this section, an unsupervised self-adaptive similarity measurement is introduced for calculating the edge weights of the graph, which is highly efficient and easy to implement. In our approach, the similarity between two proposals is defined as follows,       S pui , pvj = α · S hui,c , hvj,c + (1 − α) · S hui,s , hvj,s , (5) where hui,c and hui,s are the normalized color and Bag of Feature (BoF) [36] histograms of proposal pui , respectively. The histogram similarity is defined as follow,

  length(h)   min h p (b) , hq (b) . S h p , hq =

(6)

b=1

α (0 ≤ α ≤ 1) is the weight coefficient for the two histogram similarity terms. For different image groups, appropriate value for α is usually not fixed. Take the two image groups presented in Fig. 3 for examples, the common targets implied in kendo images share the common color appearance while the color appearances of flowers are significantly changed. Therefore, the expected feature weights for these two image groups are greatly different. In order to add adaptivity and flexibility, we take the idea from [7] and design an iterative weights setting mechanism for the features in Eq. (5). The iterative algorithm is summarized in Algorithm 1. α is firstly initialized as 0.5 to give the two features equal importance. And then, according to Eq. (5), we construct a completely connected graph by linking every two proposals with their edge being weighted by their similarity. By solving the sub-problem described in (3) with loopy belief propagation, we can achieve the initial label assignments x. We then update α by maximizing the following objective function with the initial proposal labels x,

  W xiu , xvj , α − λ · σ2 (x, α), (7) α = arg max 0≤α≤1

xiu =xvj =1

where λ is the weight and is empirically set to 100 in our experiments, σ2 (x, α) is the variance of the similarities between

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

6

Algorithm 2: Common targets multi-search algorithm

Selected target proposals

Input : Proposal sets P = {P1 , P2 , · · · , P M }, feature relative weight α and initial label assignment x = {x1 , . . . , x M } achieved with Algorithm 1. Output: Final label assignments x∗ = {x∗1 , . . . , x∗M }.

1: Initialization: x∗ = x, which is achieved with Algorithm 1. 2: for all Ii ∈ I do 3: Remove pki from Pi if xik = 1. 4: while Pi  ∅ do 5: Complete weighted graph construction with α. in (3). 6: Achieve current x by solving the sub-problem 7: Label changes counter L = #{ ji xj ⊕ ji x j }. 8: if L ≤ T then  9: Update x∗i by setting xik = 1.  k 10: Remove pi from Pi . 11: else 12: Pi = ∅. 13: end if 14: end while 15: Before processing the next image, reload Pi . 16: end for

(a) Target rd

3 search: Target 3 Selected target proposals

(b) Non-target

4th search: Terminate the currently selected proposals, which can be calculated as

   2 σ2 (x, α) = (8) W xiu , xiv , α − W . xiu =xvj =1

W is the mean value of all the similarities between those selected proposals. The intuitive intention of designing such an object function is to encourage the selected common targets to be globally consistent while keeping a low variance to make the similarity metric more reasonable and representative. After achieving the new feature coefficient α, we back to reconstruct the weighted graph and retrieve new label assignment. We repeat such iterative loop until α does not change any more. Such an adaptive feature weight selecting method greatly extends the application range of co-segmentation approach, and makes users get rid of the tedious manual feature selecting procedure. D. Common targets multi-search strategy As discussed above, for traditional proposal based cosegmentation methods, it is usually difficult to extract all the common targets implied in the images presented in Fig. 1 and Fig. 3. In this section, we will introduce an adaptive common target searching strategy that can deal with any numbers of targets. To discover as many common targets as possible in each image Ii , we repeatedly search more common targets by removing the previously discovered ones from the candidate pool Pi . And to determine whether the newly selected one is a real common target or not, a decision criterion which depends on the labels of the other images is proposed. We set an adaptive tolerance upper limit T for the number of different labels between the initial labels x and current labels x . The searching process stops when the label difference exceeds such upper limit. The proposed common targets multi-search algorithm is summarized in Algorithm 2. In step 1, we first initialize the label assignment x∗ and feature weight coefficient α with the

Fig. 4. The adaptive terminal condition for common targets multi-search. (a) The 3rd search results (the black edges connected nodes) of I1 in Fig. 2, the proposal presented by the blue node is viewed as a real common target since the co-findings (red and green nodes) remain unchanged compared with the results of the first search (the red edges connected nodes). (b) The 4th search results of I1 , the process terminates since the co-findings have considerably changed compared with the results of the first search.

results of Algorithm 1. From step 3 to step 14, the whole multi-search process for each image Ii is presented. Before every search operation, we remove the discovered proposals in previous loops from the target candidate set Pi to make sure that they will not be picked out any more. A new proposal will be considered whether it is a real common target. And then, in step 5 and step 6, we reconstruct the completely connected graph and resolve the sub-problem in (3) to achieve the new assignment x . Since the previously discovered targets have   been removed, the newly discovered one {pki |xik = 1} could be viewed as the one that is currently most likely to be the common target. As stated in step 7 and step 8, that whether  pki is a real common target or not depends on the degree of  the label changes in the other images. We assume that if pki is the real common target, it should be well consistent in feature space with the most likely targets of the other images, which shows up as they are simultaneously selected as the targets. If the number of changed labels L is less than the given adaptive threshold T , we will update the label assignment x∗ by setting  xik = 1 as step 9 presented. Here, the adaptive threshold T can be calculated as    T = M · 1 − τe−σ(x) , (9) where 1

σ (x) =  M 2

v u 2 xi,1 =x j,1

 ⎛  ⎞2 ⎜⎜⎜ W xu , xv − W ⎟⎟⎟ i,1 j,1 ⎜⎜⎜ ⎟⎟⎟ ⎜⎝ W − W ⎟⎠ max min =1

(10)

is the normalized variance of the similarities between initial selected proposals, W, Wmax and Wmin correspond to three statistics of those similarities, i.e. mean, maximum and minimum.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

7

TABLE I Segmentation performance comparison between our approach and the existing methods in terms of error rate (%, iCoseg) or IoU score (MSRC). Single common target involved image groups from iCoseg and MSRC are considered. Best results are in bold.

Fig. 5. Segmentations for single combinational common target involved images. Source images of Stonehenge2 and Kite Panda are presented in Row 1 and 4; segmentation results of our method with single searching are displayed in Row 2 and 5; segmentation results of our method with multiple searching are presented in Row 3 and 6. Segmentation becomes more complete with the proposed multiple searching strategy.

τ is a M related coefficient which is served as a compensation for the influence loss of single image when M increases, here τ = e M/1000 . Actually, we find that a relative large threshold T is usually good enough for the proposed searching strategy to tell the target and non-target regions apart. The real influence of T lies in judging ambiguous proposals where targets and ambiguous backgrounds are sticking together. The insight behind such design is that when the similarities between the selected proposals of the first searching show a relative large variance, it usually means that the common targets implied in this group are not very consistent in the feature space, so we should give a relative larger threshold T so as to extract more common targets. While, if the similarities between the selected proposals show a relative small variance, it usually means the relations between those selected targets are very consistent and robust. For this case, to avoid bringing in nontarget proposals, a relative smaller T is more reasonable. After  removing pki from Pi , we will return to step 5 to perform the next searching if Pi is still non-empty. Once the newly selected proposal does not satisfy the decision criterion presented in step 8, it will stop searching and turn to process the next image. In Fig. 4, the example from Fig. 2 is presented to visualize the adaptive terminal condition for common targets multi-search. The searching process ends when the co-findings in the other images (the red and green nodes in the graph) are considerably changed. V. Experiments To evaluate the performance of the proposed method, comparison experiments are firstly conducted on two public available standard datasets iCoseg [20] and MSRC [35]. iCoseg and MSRC are widely used by the previous co-segmentation approaches to evaluate segment performance. It involves considerable changes in viewpoints, outlines and illumination. In some more challenging image groups, the common targets are only partly captured or different numbers of targets are

iCoseg (error%)

Ours

[5]

[6]

[2]

[3]

[4]

[25]

Alaskan Stonehenge Stonehenge2 Farrari Taj Mahal Pyramids Panda Helicopter Airshows Cheetah Gymnastics Skating Balloon Statue Christ Windmill Kite Kite Panda Average

8.36 2.42 12.87 8.28 12.01 10.94 23.09 1.21 1.53 12.18 4.99 1.13 5.08 6.32 10.69 8.95 3.31 6.10 7.75

10.00 36.70 11.20 10.10 8.90 39.49 7.30 2.73 4.15 20.49 8.30 2.77 9.90 6.20 16.84 37.96 9.70 9.80 14.03

11.97 13.96 35.99 8.52 4.49 18.70 21.58 0.86 0.46 12.75 3.88 2.62 8.78 10.46 13.11 10.98 4.90 18.82 11.27

23.69 8.29 16.90 35.06 28.16 42.73 13.13 5.06 8.32 33.52 86.14 37.14 20.25 4.07 5.14 66.80 18.66 21.88 26.39

23.19 10.10 21.84 34.35 31.90 43.25 9.85 47.13 8.10 41.00 43.01 41.84 28.76 4.77 6.41 48.15 13.77 18.37 26.43

18.25 9.49 23.21 19.36 23.03 41.71 36.32 47.13 13.46 32.18 7.42 3.16 57.54 13.05 18.74 20.62 57.23 15.73 25.42

33.67 23.50 11.85 10.69 28.70 31.02 20.75 3.93 14.54 9.56 2.65 54.33 16.22 7.60 12.95 38.68 31.37 18.90 20.61

MSRC (IoU)

Ours

[16]

[6]

[2]

[3]

[4]

[25]

Plane Car Cat Dog Cow Average

0.48 0.63 0.58 0.64 0.73 0.612

0.52 0.73 0.66 0.56 0.68 0.630

0.46 0.47 0.57 0.50 0.69 0.538

0.22 0.59 0.30 0.41 0.45 0.394

0.33 0.60 0.32 0.42 0.53 0.440

0.25 0.47 0.37 0.64 0.24 0.61 0.33 0.64 0.34 0.70 0.306 0.612

involved. Besides, to further demonstrate the performance of the proposed method on the images with Indefinite Number of Common Targets, we build a new challenging dataset named Coseg-INCT by greatly expanding the Coseg-Rep [25] dataset. Exhaustive comparisons and verification experiments are conducted on this challenging dataset. Since the proposed method aims at solving complex combination targets and multiple common targets involved cosegmentation problems, we divide the comparison experiments into two parts for clarity, i.e., co-segmentation on single common target involved images (Sec. V-A) and co-segmentation on multiple common targets involved images (Sec. V-B). Finally, in Sec. V-C to test the effectiveness of multiple searching and adaptive feature weight setting mechanism, we will present contrast experiments on these two factors respectively. A. Co-seg on single common target involved images As previously mentioned, for complex targets especially the combinational targets, they are difficult to be completely extracted by the traditional proposal selection based approaches. In this section, we first conduct experiments on single common target involved image groups from iCoseg and MSRC. Proposal selection based methods [5], [6] and some other cosegmentation approaches [2]–[4], [25] are selected for comparison. In Table I, the quantitative results are presented. Since there is no unified evaluation measurement, we adopt different measurements for iCoseg (error rate) and MSRC (Intersectionover-Union score) to use as many public available evaluation

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

8

TABLE II Segmentation performance comparison between the proposed co-segmentation algorithm and the existing methods in terms of error rate (%, iCoseg) or IoU score (MSRC). Indefinite number of common targets involved image groups from iCoseg and MSRC are considered. iCoseg (error%)

Ours

[5]

[6]

[2]

[3]

[4]

[25]

Red Sox Liverpool FC Elephants Monks Speed Skating Women Skating Women Soccer Kendo Average

2.36 6.74 7.05 13.41 2.82 16.35 7.08 6.62 7.80

9.10 12.50 56.90 9.22 5.32 22.50 12.33 14.17 17.76

4.01 15.33 9.40 4.75 16.69 22.97 8.42 8.58 11.27

62.53 47.20 34.07 51.93 89.62 17.79 75.13 4.34 47.83

38.67 22.21 31.26 52.30 38.50 19.82 41.60 3.61 31.00

7.61 7.71 17.26 9.31 47.17 37.17 9.59 9.97 18.22

46.29 15.89 36.62 12.82 41.27 25.16 10.81 60.57 31.18

MSRC (IoU)

Ours

[16]

[6]

[2]

[3]

[4]

[25]

Sheep Flowers Bird Average

0.75 0.69 0.52 0.653

0.72 0.67 0.56 0.650

0.70 0.64 0.58 0.640

0.60 0.51 0.33 0.480

0.66 0.52 0.48 0.553

0.61 0.75 0.40 0.69 0.30 0.51 0.437 0.650

TABLE III Segmentation performance comparison between our method and state-of-the-art multi-class approach [15]. IoU score is considered.

Fig. 6. Segmentations for indefinite number of common targets involved images, examples from Red sox, Liverpool FC, Kendo of iCoseg and Bird, Flower of MSRC are presented. In each subfigure, the original images are presented in the first row; in the second row the segmentation results of our method with single searching are displayed; segmentation results of our method with multiple searching are presented on the bottom of each subfigure.

results as possible for fair comparison. For those approaches with no existing results but have public available projects, we revaluate them with required measurements. It can be noticed that for many image groups, especially some complex combinational target involved groups, such as stonehenge, Ferrari and Kite Panda, the proposed method performs the best. And for the other groups, even though the segmentation result is not the best, it is still competitive with the best performance. In Fig. 5, some typical segmentation examples are displayed, and it can be seen that the common targets are difficult to be extracted in one shot, but the results of the proposed method keep good completeness thanks to the multisearch strategy. B. Co-seg on multiple common targets involved images In this section, we first select multiple common targets involved image groups from iCoseg and MSRC to further

[15] Our

Red Sox 0.73 0.77

Liverpool 0.53 0.61

[15] Our

W Skating W Soccer 0.71 0.63 0.52 0.68

Elephant 0.60 0.70 Kendo 0.87 0.82

Monks 0.63 0.65

Speed 0.52 0.73

Average 0.6525 0.6850

test the performance. As can be noticed from Fig. 6, images usually contain different numbers of common targets, and significant scale, posture, appearance variation are involved. To make a fair comparison, we follows the experimental settings as Section V-A discussed. The quantitative results on those collections are displayed in Table II. It can be seen that for most groups, our approach achieves the best performance. For the groups where our method did not perform the best, our results are still competitive and very close to the best one. Fig. 6 shows some segmentation examples which demonstrate the good adaptability of our approach for those multiple common targets involved images. We also compare our method with a state-of-the-art multi-class co-segmentation approach on iCoseg dataset (multi-targets groups). In Tab. III, we give the quantitative comparison results in terms of IoU score (for [15], only IoU score is available). Among all the 8 image groups, our method achieved better performance in 6 groups. To further test the performance of our method for Indefinite Number of Common Targets, a new dataset named CosegINCT is built by expanding the Coseg-Rep [25] dataset. Coseg-Rep dataset is originally designed for co-segmenting targets with similar outlines, some groups contain indefinite number of common targets and the backgrounds are usually more complicated than iCoseg and MSRC. In order to further increase the challenge of the dataset, more images are provided for corresponding groups, and most of them contain multiple common targets and the targets usually vary a lot in

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

9

Fig. 7. Segmentation examples by the proposed method and [6] on Coseg-INCT dataset. The images are from Deer, Dog, Cormorant, Egret, Geranium and Hockey respectively. In each sheet, the original images are presented in Row 1; the segmentations by [6] are presented in Row 2; the segmentations by [6] with the same proposal as ours are presented in Row 3; the segmentations by the proposed method are presented in Row 4.

TABLE IV Segmentation performance comparison between our co-segmentation algorithm and another proposal selection based method [6] on Coseg-INCT. Error rate (%) and the numbers of extracted targets (extracted/all) are considered. * means this comparison method uses the same proposal generation algorithm with the proposed method.

[6] [6]* Our

Chrysanth 12.68/(36/75) 10.86/(49/75) 9.42/(55/75)

Cormorant Cow Deer Dog 8.54/(47/75) 9.16/(43/51) 7.84/(39/52) 8.30/(28/45) 6.23/(52/75) 8.19/(31/51) 7.45/(35/52) 8.47/(31/45) 7.67/(60/75) 8.59/(46/51) 10.35/(44/52) 4.44(40/45)

[6] [6]* Our

Egret Firepink Geranium Hockey Rabbit 7.05/(31/63) 4.19/(31/39) 10.71/(30/60) 14.28/(30/49) 7.00/(31/46) 6.80/(32/63) 5.21/(31/39) 9.23/(38/60) 11.56/(31/49) 8.90/(33/46) 6.28/(45/63) 6.55/(38/39) 5.37(48/60) 8.02(43/49) 5.62(37/46)

[6] [6]* Our

Seagull 11.91/(27/64) 11.99/(28/64) 11.04(40/64)

Whitecampion 7.89/(27/52) 8.16/(31/52) 3.69(49/52)

Average/Overall 9.13/(400/671) 8.59/(422/671) 7.25/(545/671)

appearance, some examples are presented in Fig. 7. In total, Coseg-INCT contains 12 categories and 291 images, where

671 targets are involved (about 2.3 targets for each image). We compare our method with the traditional proposal selection based co-segmentation approach [6] on this dataset, and the quantitative results and segmentation examples are presented in Tab. IV and Fig. 7. To give a more fair comparison, we also evaluate [6] with the same proposal generation algorithm as ours. Besides, for [6] and [6]*, the feature should be manually chosen as shape or color. Here, we test these two feature settings and represent the better one. It can be noticed from Tab. IV that for most of the groups, the error rates of our method are less than [6]. Moreover, when considering the numbers of extracted common objects, the advantage of the proposed method is more attractive, for all the groups it extracts the most targets. For Cow and Firepink, the relative lower error rates of [6] are owing to the use of saliency, and saliency for these two groups are more effective information. As to Cormorant and Deer, whose backgrounds are very complicated, the error rates are slightly raised because some ambiguous regions are wrongly labeled when our method tries to extract more targets. But even so, our results for these groups are still very competitive.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

TABLE V Contrast experiments with different searching thresholds. iCoseg and Coseg-INCT are used. Error rate (%) and the numbers of extracted targets (extracted/all) are considered. The latter is only available for iCoseg-Multi and Coseg-INCT. AT is short for Adaptive Threshold.

iCoseg-Single iCoseg-Multi Coseg-INCT

Multiple Search & AT 7.75 7.80/(381/441) 7.25/(545/671)

iCoseg-single iCoseg-Multi Coseg-INCT

θ = 0.4 8.36 12.43/391 11.11/589

θ = 0.3 8.30 10.03/388 8.89/575

Single Search 9.06 9.21/(292/441) 9.01/(387/671) θ = 0.2 8.37 8.92/386 7.45/548

θ = 0.1 8.54 7.92/341 7.32/489

C. Contrast experiments on multiple searching and adaptive feature weight setting In the proposed co-segmentation method, the multiple searching strategy makes up the loss of foreground integrity with traditional single searching, and the adaptive feature weight setting mechanism provides more flexibility and releases the burden of users. In this section, we will give several contrast experiments to further demonstrate the effectiveness of these two strategies. In the first contrast experiment, we replace the multiple searching module with the traditional single searching, and revaluate the performance on iCoseg and Coseg-INCT with this change. As presented in Tab. V, when adopting the single searching strategy, the error rate for iCoseg-Single and iCoseg-Multi increases from 7.75 to 9.06 and 7.80 to 9.21, respectively. Considering that many missed common targets are only small local regions, this is already a considerable degradation in error rate measurement. And for Coseg-INCT, the result is decreased from 7.25 to 9.01. In terms of the numbers of extracted targets, the advantage of our multiple search method is much more significant. The total number of extracted targets for iCoseg-Multi and Coseg-INCT increases from and 292 to 381 and 387 to 545, respectively. Besides, for the multiple searching strategy, threshold T is an important parameter. When T is too large, although it can extract more targets, it may also give wrong labels for nontarget regions. While for a too small threshold, the cost of low false detecting rate is that it may miss more targets. In our implementation, an adaptive threshold which is related to the similarity variance is used. Another straightforward design for T is using a fixed portion (indicated by θ) of the total number of the images. We test different threshold settings on iCoseg and Coseg-INCT, the corresponding results in terms of error rate and the numbers of extracted targets are reported in Tab. V. As can be noticed, when we change θ from 0.4 to 0.1, the total segmentation error rates for iCoseg-Multi and Coseg-INCT decrease from 12.43 to 7.92 and 11.11 to 7.32, respectively. However, as analysed above, the number of extracted targets also decreases when T becomes smaller. But our adaptive thresholds pull the algorithm out of such dilemma to some extent. As presented in Tab. V, it achieves the lowest error rates while the target extracting ability remains relatively strong. Beside, as can be seen from the results for

10

TABLE VI Contrast experiments with different feature combination. iCoseg and Coseg-INCT are used. Error rate (%) and the numbers of extracted targets (extracted/all) are considered. The latter is only available for iCoseg-Multi and Coseg-INCT.

iCoseg-Single iCoseg-Multi Coseg-INCT iCoseg-Single iCoseg-Multi Coseg-INCT

Adaptive Feature 7.75 7.80/(381/441) 7.25/(545/671) α = 0.2 9.00 8.35/373 8.03/506

α = 0.5 8.01 8.30/378 8.04/498

α=0 9.10 8.75/(367/441) 8.07/(524/671) α = 0.8 8.80 8.27/367 8.70/475

α=1 9.62 8.56/342 9.85/422

iCoseg-Single, our adaptive threshold also achieves obvious improvement on single target involved images compared with the fixed thresholds. In the next experiment, we will explore the influence of feature selection. In this contrast experiment, we give α five fixed values (0, 0.2, 0.5, 0.8 and 1) and make them as the control groups. Multiple searching module is still available for these control groups. As can be noticed from Tab. VI, when α increases from 0 to 0.5, the error rate for iCoseg falls down. This is because the targets from iCoseg groups usually share the same color distributions or even are the same targets, hence color histogram is an effective consistent information. Even so, totally depending on the color feature is still not a wise choice, that is because some backgrounds are also very similar with the targets, so the BoF can be a necessary supplement. The performance changing from α = 0.5 to α = 1 has proved this point, the performance starts to get worse when the weight of color keeps increasing. As to Coseg-INCT, the color distribution for the common targets are of great variation (except Cormorant, Egret, Firepink and Whitecampion), so generally the color information is relative less important than BoF. In general, with the increase of α, the performance decreases. However, the less importance does not mean the color is totally harmful for similarity measurement. Actually the descriptive ability of BoF is still imperfect since the innerclass variation is very significant, so color can be a necessary assistance to help to tell some target-like background regions out and recognize the real targets. Back to see the performance of adaptive feature, the advantage is obvious. On both iCoseg and Coseg-INCT datasets, adaptive feature achieves the lowest error rates and extracts the most common targets. VI. Discussion Since our method is based on proposal selection, the quality of generated proposal will directly impact the final segmentation. If only a few proposals can cover all the implied common targets, it will improve the efficiency of the proposed method. Thanks to the proposal ranking system of [21], in our experiment we selected the top 10 proposals for each image, and that is satisfactory enough. To further test the influence of the number of proposal candidates for the performance, we test our method with K = 10 and K = 20 on the three datasets. The quantitative evaluation are presented in Table

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

TABLE VII Contrast experiments on different numbers of proposal candidates. Segmentation error rate (%) and the numbers of extracted objects (only for multiple targets involved groups) are considered.

K = 10 K = 20

iCoseg 7.77/381 7.89/384

MSRC 13.49/166 13.26/166

Coseg-INCT 7.25/545 7.69/536

VII. It can be noticed that, on all the three datasets there is only a slight difference between the results of different K. This demonstrates the conclusion that most common targets have been included in the top 10 proposals, and our multiple searching strategy can usually avoid useless searching thanks to the adaptive terminal condition. Considering the extra computation cost of adding more proposals, K = 10 is an ideal choice for our method. We implemented the proposed method with MATLAB and the experiments are conducted on a PC with a 3.0 GHz CPU and 4GB RAM. The running time is mainly cost by proposal generation and feature extraction. For an image with 300×200 size, it costs about 1min for proposal generation and 20sec for BoF feature extraction. And overall, for 10 images, it costs about 15min for completing the whole co-segmentation process. As can be noticed from our objective function organized in Eq. 2, only pairwise similarities between object proposals are considered. An intuitive idea is that objectness scores or saliency values can possibly act as unary terms in the formulation. Actually, by incorporating the unary terms into the pairwise ones, the new objective function has nothing different with Eq. 2 in form, so it is still tractable in our framework. In this work, we did not introduce unary terms for two reasons. Firstly, co-segmentation focuses on extracting the common targets by mining characters in common, however objectness/saliency focuses on individual information, so only adopting pairwise terms can further highlight the effectiveness of the proposed framework for co-segmentation. Secondly, for some image groups where distinctive contrast exists between the foreground and background, adopting unary terms is a good choice for better performance, but sometimes bad objectness/saliency estimation may make the results even worse. How to introduce the unary terms effectively will be a worthwhile topic in our future work. In this paper, common targets are mainly defined as the objects of the same category. However, in the second image group of Fig. 6, the football players in red clothes are more likely to be viewed as the real common targets since they appear in every single image and are highly correlated with each other. Accordingly, our feature selecting strategy provides the color feature with a relative large weight, which in turn will make the players in non-red clothes more like noises in the multiple searching steps. It is acceptable and consistent with subjective judgment of observers because they are less related with those common targets and only appear in a few images. The same is true for the women players in Fig. 1 and Fig. 2.

11

Our framework is still open for multi-class problems. A feasible solution could be clustering the achieved proposals beforehand and then searching the common targets of each class separately. The extensibility of the proposed method also lies in integrating more features in feature selecting steps. Theoretically, a straightforward extension for similar metrics is to tune the weight for each metric separately and iteratively. In each weight updating loop, only one weight coefficient is variable and the weights for the other features keep the fixed ratio according to their weights of last loop. By optimizing the same objective function in Eq. 7, we could get close to the appropriate weight coefficients gradually. Actually, these are still theoretical extension strategies, and we are looking forward to exploring these topics in our future works. VII. Conclusion This paper introduces an unsupervised co-segmentation approach for indefinite number of common targets involved images, which extends the existing proposal selection based methods by introducing an adaptive multi-search strategy. To increase the accuracy and flexility of similarity measurements, we introduce an adaptive feature weights selecting algorithm, which is quite simple and effective. Exhaustive experiments on iCoseg, MSRC and Coseg-INCT have demonstrated its good performance. Acknowledgment The research has been supported by the National Natural Science Foundation of China under Grant 61371140. References [1] C. Rother, T. Minka, A. Blake, and V. Kolmogorov, “Cosegmentation of image pairs by histogram matching - incorporating a global constraint into mrfs,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, New York, NY, USA, Jun. 2006, pp. 993–1000. 1 [2] A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for image co-segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Francisco, CA, USA, Jun. 2010, pp. 1943–1950. 1, 2, 7, 8 [3] ——, “Multi-class cosegmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2012, pp. 542–549. 1, 2, 7, 8 [4] G. Kim, E. Xing, L. Fei-Fei, and T. Kanade, “Distributed cosegmentation via submodular optimization on anisotropic diffusion,” in Proc. Int. Conf. Computer Vision, Barcelona, Spain, Nov. 2011, pp. 169–176. 1, 2, 7, 8 [5] S. Vicente, C. Rother, and V. Kolmogorov, “Object cosegmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2011, pp. 2217–2224. 1, 3, 5, 7, 8 [6] F. Meng, H. Li, G. Liu, and K. N. Ngan, “Object co-segmentation based on shortest path algorithm and saliency model,” Multimedia, IEEE Transactions on, vol. 14, no. 5, pp. 1429–1441, 2012. 1, 3, 5, 7, 8, 9 [7] F. Meng, H. Li, K. N. Ngan, L. Zeng, and Q. Wu, “Feature adaptive co-segmentation by complexity awareness,” Image Processing, IEEE Transactions on, vol. 22, no. 12, pp. 4809–4824, 2013. 1, 2, 3, 5 [8] L. Mukherjee, V. Singh, and C. Dyer, “Half-integrality based algorithms for cosegmentation of images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, Jun. 2009, pp. 2028–2035. 2 [9] D. Hochbaum and V. Singh, “An efficient algorithm for cosegmentation,” in Proc. Int. Conf. Computer Vision, Kyoto, Japan, Sep. 2009, pp. 269–276. 2 [10] S. Vicente, V. Kolmogorov, and C. Rother, “Cosegmentation revisited: Models and optimization,” in Proc. Eur. Conf. Computer Vision, Crete,Greece, Sep. 2010, vol. 6312, pp. 465–479. 2

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2526900, IEEE Transactions on Image Processing

SUBMISSION TO IEEE TRANSACTIONS ON IMAGE PROCESSING

[11] L. Mukherjee, V. Singh, and J. Peng, “Scale invariant cosegmentation for image groups,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2011, pp. 1881–1888. 2 [12] D. Lowe, “Object recognition from local scale-invariant features,” in Computer Vision, IEEE International Conference on, vol. 2, 1999, pp. 1150–1157. 2 [13] L. Mukherjee, V. Singh, J. Xu, and M. Collins, “Analyzing the subspace structure of related images: Concurrent segmentation of image sets,” in Proc. Eur. Conf. Computer Vision, vol. 7575, 2012, pp. 128–142. 2 [14] G. Kim and E. Xing, “On multiple foreground cosegmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2012, pp. 837–844. 2 [15] H. Li, F. Meng, Q. Wu, and B. Luo, “Unsupervised multiclass region cosegmentation via ensemble clustering and energy minimization,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 24, no. 5, pp. 789–801, 2014. 2, 8 [16] F. Wang, Q. Huang, M. Ovsjanikov, and L. Guibas, “Unsupervised multiclass joint image segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2014, pp. 3142–3149. 2, 7, 8 [17] Y. Chai, V. Lempitsky, and A. Zisserman, “Bicos: A bi-level cosegmentation method for image classification,” in Computer Vision, IEEE International Conference on, Nov 2011, pp. 2579–2586. 2 [18] E. Kim, H. Li, and X. Huang, “A hierarchical image clustering cosegmentation framework,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2012, pp. 686–693. 2 [19] J. Rubio, J. Serrat, A. Lopez, and N. Paragios, “Unsupervised cosegmentation through region matching,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2012, pp. 749–756. 2 [20] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Interactive co-segmentation with intelligent scribble guidance,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Francisco, CA, USA, Jun. 2010, pp. 3169–3176. 2, 5, 7 [21] I. Endres and D. Hoiem, “Category-independent object proposals with diverse ranking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 2, pp. 222–234, Feb 2014. 2, 3, 4, 5, 10 [22] K.-Y. Chang, T.-L. Liu, and S.-H. Lai, “From co-saliency to cosegmentation: An efficient and fully unsupervised energy minimization model,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, Jun. 2011, pp. 2129–2136. 2 [23] M. Collins, J. Xu, L. Grady, and V. Singh, “Random walks based multiimage segmentation: Quasiconvexity results and gpu-based solutions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2012, pp. 1656–1663. 2 [24] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsupervised joint object discovery and segmentation in internet images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2013, pp. 1939–1946. 2 [25] J. Dai, Y. N. Wu, J. Zhou, and S.-C. Zhu, “Cosegmentation and cosketch by unsupervised learning,” in Proc. Int. Conf. Computer Vision, 2013. 3, 7, 8 [26] W. Tao, K. Li, and K. Sun, “Sacoseg: Object cosegmentation by shape conformability,” IEEE Trans. Image Processing, vol. 24, no. 3, pp. 943– 955, March 2015. 3 [27] H. Fu, D. Xu, S. Lin, and J. Liu, “Object-based rgbd image cosegmentation with mutex constraint,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015. 3 [28] V. Badrinarayanan, I. Budvytis, and R. Cipolla, “Semi-supervised video segmentation using tree structured graphical models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 11, pp. 2751–2764, Nov 2013. 3 [29] Z. Lou and T. Gevers, “Extracting primary objects by video cosegmentation,” Multimedia, IEEE Transactions on, vol. 16, no. 8, pp. 2110–2117, Dec 2014. 3 [30] D. Zhang, O. Javed, and M. Shah, “Video object co-segmentation by regulated maximum weight cliques,” in Proc. Eur. Conf. Computer Vision, vol. 8695, 2014, pp. 551–566. 3 [31] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmentation using constrained parametric min-cuts,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 7, pp. 1312–1328, 2012. 3 [32] H. Fu, D. Xu, B. Zhang, and S. Lin, “Object-based multiple foreground video co-segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2014, pp. 3166–3173. 3

12

[33] B. Andres, J. H. Kappes, U. Kothe, C. Schnorr, and F. A. Hamprecht, “An empirical comparison of inference algorithms for graphical models with higher order factors using opengm,” in DAGM, 2010. 4 [34] M. Bergtholdt, J. Kappes, and S. Schmidt, “A study of parts-based object class detection using complete graphs,” International Journal of Computer Vision, vol. 87, no. 1-2, pp. 93–117, 2010. 4 [35] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in Proc. Eur. Conf. Computer Vision, 2006, vol. 3951, pp. 1–15. 5, 7 [36] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, June 2005, pp. 524–531. 5

Kunqian Li received his BS degree in Electrical & Information Engineering from China University Of Petroleum (UPC), QingDao, China, in 2012. He is currently working with Prof. Tao for the PhD degree in National Key Laboratory of Science & Technology on Multi-spectral Information Processing, School of Automation, Huazhong University of Science and Technology (HUST), Wuhan, China. His research interests include image segmentation and object recognition.

Jiaojiao Zhang received her BS degree in Electrical & Information Engineering from China University Of Geosciences (CUG), Wuhan, China, in 2013. She is now a third year Master Candidate in National Key Laboratory of Science & Technology on Multi-spectral Information Processing, School of Automation, Huazhong University of Science and Technology (HUST), Wuhan, China. Her research interests include image and video segmentation.

Wenbing Tao received his Ph.D. degree in pattern recognition and intelligent system from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2004. Now he is a full professor in School of Automation, HUST. He is a research fellow in Division of Mathematical Sciences, Nanyang Technology University from Mar. 2008 to Mar. 2009. Dr. Tao’s research interests lie in the area of computer vision, image segmentation, object recognition and tracking. He has published numerous papers and conference papers in the area of image processing and object recognition. He serves as Reviewer for many journals, such as International Journal of Computer Vision, IEEE Transaction on Image Processing, Pattern Recognition, and so on.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Unsupervised Co-Segmentation for Indefinite Number of Common Foreground Objects.

Co-segmentation addresses the problem of simultaneously extracting the common targets appeared in multiple images. Multiple common targets involved ob...
5MB Sizes 6 Downloads 16 Views