Multilabel image classification via high-order label correlation driven active learning.

1430

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

Multilabel Image Classification via High-Order Label Correlation Driven Active Learning Bang Zhang, Member, IEEE, Yang Wang, Senior Member, IEEE, and Fang Chen, Senior Member, IEEE

Abstract— Supervised machine learning techniques have been applied to multilabel image classification problems with tremendous success. Despite disparate learning mechanisms, their performances heavily rely on the quality of training images. However, the acquisition of training images requires significant efforts from human annotators. This hinders the applications of supervised learning techniques to large scale problems. In this paper, we propose a high-order label correlation driven active learning (HoAL) approach that allows the iterative learning algorithm itself to select the informative example-label pairs from which it learns so as to learn an accurate classifier with less annotation efforts. Four crucial issues are considered by the proposed HoAL: 1) unlike binary cases, the selection granularity for multilabel active learning need to be fined from example to examplelabel pair; 2) different labels are seldom independent, and label correlations provide critical information for efficient learning; 3) in addition to pair-wise label correlations, high-order label correlations are also informative for multilabel active learning; and 4) since the number of label combinations increases exponentially with respect to the number of labels, an efficient mining method is required to discover informative label correlations. The proposed approach is tested on public data sets, and the empirical results demonstrate its effectiveness. Index Terms— Active learning, high-order label correlation.

multilabel

classification,

I. I NTRODUCTION

O

NE of the principal difficulties in applying supervised learning techniques to image classification problems is the large amount of labeled training images that are required. In many cases, unlabeled images are easy to obtain, while annotation is expensive or time consuming. This necessitates active learning [1], [2] which allows the learning algorithm to actively select the images from which it learns. Its key idea is to find the most informative images for annotation with respect to the maximal improvement to current classifier’s performance, thereby reducing the annotation cost. Active learning is performed in an iterative fashion. Taking traditional binary myopic active learning as an example, Manuscript received April 7, 2013; revised September 20, 2013; accepted January 6, 2014. Date of publication January 27, 2014; date of current version February 18, 2014. This work was supported in part by the Australian Government through the Department of Broadband, Communications and the Digital Economy and in part by the Australian Research Council through the ICT Centre of Excellence Program, National ICT Australia. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Nikolaos V. Boulgouris. The authors are with National ICT Australia, Sydney, NSW 2015, Australia (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2302675

Fig. 1. Internet images with multi-label characteristic. Pair-wise and higher order label correlations are manifest, and crucial to efficient image classification.

each its learning iteration selects the example with the highest informativeness score for annotation, and the classifier is retrained on the training dataset with new labeled example included. The learning process continues until all the annotation resources are depleted or the obtained classifier’s performance is accurate as desired. One problem of binary myopic active learning is that at each iteration only one example is selected for annotation, and the classifier has to be retrained whenever a single example is annotated. The whole learning process is inefficient and impractical, especially when retraining process is computational expensive and parallel annotation systems, like Mechanical Turk1 [3] and LabelMe2 [4], are available. In order to overcome such drawback, batch mode active learning has recently attracted increasing attentions. It aims to select a batch of informative examples instead of a single example for annotation at each learning iteration. The key difficulty comes from the potential information overlapping within the selected examples at each iteration, namely the selected examples need to be not only informative but also diverse. Several batch mode active learning approaches have been proposed recently. But most of them focus on binary classification problems in which an example is only associated with one label. In contrast, real-world applications, such as semantic Internet image classification and retrieval, usually exhibit multi-label characteristic. Fig. 1 shows typical images from Internet with beach theme. In this paper, we tackle image classification as a multi-label batch mode active learning (MLBAL) problem. 1 https://www.mturk.com 2 http://labelme.csail.mit.edu/

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

ZHANG et al.: MULTILABEL IMAGE CLASSIFICATION VIA HoAL

Fig. 2. Framework for applying the proposed HoAL to real-world scale semantic image classification and retrieval, e.g., Internet image search engine.

With multi-label and batch mode factors considered, the proposed HoAL makes active learning applicable to real-world scale image classification and retrieval, e.g., Internet image search engine. As Fig. 2 shows, images are crawled and sent to the engine (step 1), and active learning algorithm (HoAL) selects the most informative example-label pairs for human experts to annotate (step 2, 3, 4 5 and 6). HoAL trains a classifier on the labeled training data (step 7). If the learned classifier can accurately classify images, the active learning stops. Otherwise, the active learning conducts another learning iteration (step 2-7). Such iterative learning process continues until the obtained classifier meets the desired criteria. It is worth noting that parallel annotation systems, e.g., Mechanical Turk and LabelMe, can be seamlessly integrated into the framework as annotators. There are several important issues need to be addressed for MLBAL. Firstly, example-label pair instead of example becomes the annotation selection granularity. In MLBAL, every example has multiple labels, thus the active learner has to select not only examples but also their labels for annotation. It is noted that for a certain example, the contributions of its labels to improve classifier’s performance are different due to the inherent label correlations. So in order to achieve efficient active learning, the selection granularity needs to be fined from example level to example-label pair level. In other words, the active selection for MLBAL is conducted at both example and label dimensions. Secondly, labels are usually dependent, and their inherent correlations are useful for inferring unknown labels from known labels. Fig. 1 provides examples of label correlations. Binary batch mode active learning method only needs to select examples for annotation. Thus, it only exploits the information redundancy at example dimension. Such redundancy is caused by the similarities among different examples. However for MLBAL, information redundancy also occurs at label

1431

dimension, which is caused by the inherent label correlations. Therefore the intuition is that in addition to example similarities, label correlations can also be exploited to further improve the performance of active learning. Thirdly, although several algorithms have been proposed to consider label correlations for multi-label classification, most of them only exploit low order label correlations (e.g., pairwise label correlations) due to the computational complexity. The search efforts increase exponentially when one more order is considered for searching informative label correlations. Nevertheless, high order label correlations often reveal vital information for efficient active learning. An efficient method is in demand for discovering useful high order label correlations. Fourthly, finding informative label correlations is not trivial, especially for higher order label correlations. Uninformative label correlations could deteriorate learning performance. Therefore, proper measurement for the informativeness of label correlations is important. And the discovery process for informative correlations should be efficient, especially when the size of the dateset is huge. The proposed method is developed with the consideration of all these issues: (1) A score function is defined to measure the informativeness of example-label pairs. It is designed based on both likelihood maximization (on labeled data) and uncertainty minimization (on unlabeled data). (2) In order to take advantage of informative label correlations, we define cross-label uncertainty which gauges the disagreement between the mined label correlation and the label co-occurrence possibility from the learned classification mode. Kullback-Leibler (KL) divergence [5] is utilized to measure such cross-label uncertainty. (3) The proposed method considers not only pair-wise but also higher order label correlations. An auxiliary compositional label is defined as a combination of primary labels with the interest of utilizing informative high order correlations. (4) For informative label correlation discovery, an efficient data mining method called association rule mining [6] is adopted. An informative correlation can be found from one association rule, and its informativeness is measured by the support and confidence of the rule. The rest of this paper is organized as follows: Related work is given in Section II. Section III presents the proposed method. Experiments are conducted in Section IV. Discussion and conclusions are in Section V. II. R ELATED W ORK Active learning has been extensively studied for a number of years, and researchers addressed it in a variety of ways including methods based on uncertainty sampling [7], version space of SVM [8], disagreement among classifiers [9], and expected informativeness [10]. A comprehensive survey can be found in [2]. A large portion of the existing active learning techniques are designed for myopic active learning: only one example is selected for annotation at each learning iteration, and the classifier is updated every time when a new annotated

1432


example becomes available. Such setting hinders the adoption of parallel annotation systems, and incurs heavy computational cost on classifier updates, thereby preventing active learning being utilized on large scale real world applications, such as automatic Internet image annotation. Therefore, in order to overcome the drawbacks of the myopic active learning, batch mode active learning is proposed, and has attracted increasing attentions [11]–[21]. It aims to select a batch of informative examples for annotation at each learning iteration, rather than one example at a time. A naive way to select a batch of informative unlabeled examples is to choose the most informative k examples. However, the problem with such strategy is that some selected examples might be similar, or even identical. And this incurs information overlapping. So the key difficulty for achieving informative batch selection is to reduce the information redundancy among the selected examples. Based on such concerns, [11] performs batch mode selection by considering both the diversity and informativeness of the selected examples in SVM framework. [19] also takes into account the diversity of the selected examples by querying cluster centroids that are close to the decision boundary. [20], [21] incorporate Fisher information to batch mode active learning for binary logistic regression. [12], [13] tackle batch mode active learning under the semi-supervised learning setting. [12], [14] solve the problem in a discriminative fashion. They select examples that can directly maximize objective function or maximize the cost function reduction. [17] formulated the selection task as a matrix partition problem. It tries to maximize a mutual information criterion between the labeled and unlabeled examples. Overall, all these methods are designed for binary classification problems and do not consider label correlations. For multi-label active learning, [22] utilize mutual information to measure the correlation between labels to achieve efficient learning. And [23] extends the Value of Information framework to take into account label correlations. But both of them are designed for myopic active learning. In this paper, we consider a more challenging problem, namely multi-label batch mode active learning. For multi-label classification problems, most of the existing methods adopt one-vs-one or one-vs-all strategies to convert the original problem into a set of binary problems. And other approaches [24]–[26] adopt ranking-based learning strategy which calculates real-valued scores for different labels and classify examples by selecting all the labels with the scores larger than a predefined threshold. Although all these strategies can tackle multi-label classification problems, they generally do not consider the label correlations. There exist several methods [22], [27]–[33] taking into account the label correlations for multi-label classification problems. However, due to the computational complexity, they usually only consider pair-wise label correlations.

of the problem is firstly given in Section III-A. Section III-B describes the score function for example-label pairs. Mining process for discovering informative label correlations is presented in Section III-C. Optimization techniques for maximizing the score function and procedures for parameter tuning are given in Section III-D and III-E respectively.

III. M ULTI -L ABEL BATCH M ODE ACTIVE L EARNING In this section, we present the proposed HoAL for Multi-label Batch Mode Active Learning. Formal definition

A. Problem Formulation Formally, we use X to denote the feature space of examples. And we assume there is a label set containing K different labels. Then the labels associated with an example x ∈ X compose a subset of , which can be represented as a K -dimensional binary vector Y = {y1 , . . . , y K }, with 1 indicating that the example belongs to the corresponding concept and −1 otherwise. While active learning process proceeds, some of the labels of an example x are annotated while others are not. And the current annotated labels of x can form a labeled example-label pair set L P(x) = {(x, yi )|yi is labeled}, and the rest labels can form an unlabeled example-label pair set U P(x) = {(x, yi )|yi is unlabeled}. Initially, the active learning algorithm is provided with a small number of annotated example-label pairs, L 0 = {L P(x 1 ), . . . , L P(x N )} and a large number of unlabeled example-label pairs, U 0 = {U P(x 1 ), . . . , U P(x N )}. N is the size of the example pool containing all the labeled, unlabeled, and partially labeled examples. Initial prediction models P(y j |x, w0j ), 1 ≤ j ≤ K can be obtained based on L 0 and U 0 . w0j is the model parameter vector. At each learning iteration t, a batch of m unlabeled example-label pairs S t ⊆ U t −1 are selected for annotation. m is the predefined batch selection size. And the example-label pair sets are then updated as: U t = U t −1 −S t and L t = L t −1 ∪S t . Subsequently, the updated prediction modes P(y j |x, wtj ) can be obtained on L t and U t . This process repeats until the stop criterion is reached. And the goal is to search for the optimal selection S t which leads to the best prediction models P(y j |x, wtj ) at each learning iteration. It is worth to note that example-label pair sets L, U , and S reduce to example sets automatically when the target problem is degraded from multi-label problem to binary problem. And for notational convenience, we keep using the same symbols for binary cases. B. Informative Example-Label Pairs Selection In classification settings, every datum is supposed to be associated with labels. But in reality, most of data are unlabeled. Label acquisition needs significant human efforts. For supervised learning, within the available data repository, only part of the data are labeled and utilized for training. The unlabeled data are ignored. Classifier is learned by maximizing the likelihood of labeled examples. For semi-supervised learning, unlabeled data are also considered for training. Classifier is learned by simultaneously maximizing the likelihood of the labeled data and minimizing the label uncertainty of the unlabeled data. Its objective


function can be represented as: L(x i , yi , w) − α UC(x j , w).

1433

(1)

j ∈U

i∈L

α is a trade-off parameter for adjusting the relative influence of the labeled and unlabeled data. L and UC indicate likelihood and uncertainty functions respectively. Active learning goes one step further. It actively selects the most informative unlabeled examples for labeling at each learning iteration, so as to gradually learn more accurate classifiers with less annotation efforts. With selected examples annotated and included into training dataset, classifier is updated. Within all the possible selections, the optimal selection leads to an updated classifier which maximizes the likelihood of the labeled data and minimizes the uncertainty of the unlabeled data as semi-supervised learning. Therefore, a score function measuring the informativeness of selected examples can be defined as: L(x i , yi , wt ) − α UC(x j , wt ), (2) f (S) = i∈L t−1 ∪S

The essential idea is: for a specific label, besides the prediction model that is learned from its training examples, its observed correlations with other labels can also provide predictions for it. Considering the previous example, since the two labels are highly correlated, the prediction for the label “ocean” can be regarded as a prediction for the label “beach” as well. The prediction model disagreement from the label “ocean” to the label “beach” can be measured by the KL divergence D K L (P(ybeach |x)P(yocean |x)). Thus, the cross-label uncertainty of the unlabeled examplelabel pair (x, ybeach ) can be measured by the sum of the KL divergences from all its correlated labels. If we use c(ys ) and C ys = |c(ys )| to denote all the correlated labels of ys and the number of the correlated labels respectively, the score function of selection can be redefined by taking into account cross-label uncertainty: f (S) = log P(yr |x i , wrt ) (x i ,yr )∈L t−1 ∪S

(x j ,ys

where wt is the parameter vector learned on the updated dataset L t −1 ∪ S. And the optimal selection S ∗ is the selection with the highest score. Following similar consideration, [12] defines a score function for binary batch mode active learning by considering log likelihood on labeled data and adopting entropy as the uncertainty measure on unlabeled data: log P(y|x i , wt ) − α H (y|x j , wt ). f (S) = i∈L t−1 ∪S

−α

j ∈U t−1 −S

H (ys |x j , wst )

)∈U t−1 −S

+

yt ∈c(ys )

= (x i ,yr

−α

1 D K L (Pys Pyt ) (6) C ys

log P(yr |x i , wrt )

)∈L t−1 ∪S

(x j ,ys )∈U t−1 −S

1 C ys

H (Pys , Pyt ),

yt ∈c(ys )

j ∈U t−1 −S

(3) This score function can be naively extended for MLBAL to measure the informativeness of example-label pairs: log P(yr |x i , wrt ) f (S) = (x i ,yr )∈L t−1 ∪S

−α (x j ,ys

where H (ys |x j , wst ) = −

H (ys |x j , wst ),

(4)

)∈U t−1 −S

P(ys |x j , wst ) log P(ys |x j , wst ) (5)

ys =±1

measures the entropy of the unlabeled example-label pair (x j , ys ). wst indicates model parameter vector obtained at iteration t for label s. However, such extension only considers the uncertainty of a single label, namely the single-label uncertainty. In addition to it, we consider the cross-label uncertainty which comes from the disagreement between the observed label correlation and the learned label prediction. For instance, in Fig. 1, we observe that image labels “beach” and “ocean” co-occur frequently, which indicates the two labels are highly correlated. Then, the uncertainty between the two labels over an example image x appears if the predicted probabilities P(ybeach |x) and P(yocean |x) conflict with each other. In order to utilize it, we propose to adopt KullbackLeibler divergence [5] to measure the cross label uncertainty.

(7) where

P(ys |x j , wst ) , P(yt |x j , wtt ) ys =±1 (8) P(ys |x j , wst ) log P(yt |x j , wtt ). (9) H (Pys , Pyt ) = −

D K L (Pys Pyt ) = −

P(ys |x j , wst ) log

ys =±1

Recall that KL divergence, D K L (Pys ||Pyt ), is an asymmetric measure of the difference between two probability distributions Pys and Pyt . It increases with the discrepancy of Pys from Pyt . Cross entropy H (Pys , Pyt ) = H (Pys ) + D K L (Pys ||Pyt ), measures the average coding length of a variable generated by distribution Pys by using a coding scheme which is based on another distribution Pyt . It captures both the uncertainty of Pys and the inconsistency between Pys and Pyt . It is well known via minimum cross entropy principal [34]. Both KL divergence and cross entropy have been widely used in pattern recognition and machine learning [35], [36], and specifically for active learning [23], [37]–[39]. t The first term of Eq. 6, (x i ,yr )∈L t−1 ∪S log P(yr |x i , wr ), represents the log likelihood for the annotated example-label pairs which consist of the annotated example-label pairs at the previous learning iteration, namely L t −1 , and the new selected example-label pairs at the current learning iteration, namely S. The second term of Eq. 6, −α (x j ,ys )∈U t−1 −S H (ys |x j , wst ), represents the single label uncertainty of the unlabelled pairs

1434


at the current learning iteration. U t −1 denotes the unlabelled pairs at the previous learning iteration. The third term of Eq. 6, −α (x j ,ys )∈U t−1 −S C1y yt ∈c(ys ) D K L (Pys Pyt ) , s indicates the cross label uncertainty of the unlabelled examplelabel pairs. For each unlabelled example-label pair (x j , ys ), there are C ys correlated labels for label ys . Here, by stating two labels are correlated, it means two labels co-occur frequently. Namely they are highly positively correlated. For each of ys ’s correlated labels, yt , Pyt represents the predicted probability of x j having label yt . Pys represents the predicted probability of x j having label ys . Thus, D K L (Pys ||Pyt ) measures the discrepancy between the predictions for ys and yt . Since two labels are highly correlated, such discrepancy indicates the conflict between he observed label correlation and the learned prediction model. The larger the discrepancy is, the severe the conflict is. Such conflict reflects the cross-label uncertainty between ys and yt . Therefore, the cross-label uncertainty of the unlabelled example-label pair (x j , ys ) is measured by the sum of the uncertainty between ys and each of its correlated labels. It is noted that the label with more correlations has potential to bring more uncertainties and subsequently dominate the selection. In order to avoid such situation, the summation of the KL divergences in the Eq. 6 is normalized by C ys . For wtt in Eq. 8 and Eq. 9, the superscript t denotes t-th iteration, and the subscript t indicates the t-th label. It is worth to note that c(ys ) in Eq. 6 contains at least one correlated label which is ys itself and C ys ≥ 1. Therefore the cross entropy H (Pys , Pyt ) in the last term of Eq. 7 reduces to entropy H (Pys ) when c(ys ) only contains ys itself. By adopting the score function defined by Eq. 7, singlelabel uncertainty and cross-label uncertainty are unified and generalized to multi-label uncertainty which is measured by cross entropy of correlated labels. As mentioned before, some of the informative label correlations might involve more than two labels, and we call them high order label correlations. Such correlations are crucial to accurate label inferences. They help to narrow down the contexts of labels, thereby reducing their semantic ambiguities. For example, class label “Apple” can indicate either a fruit type or a computer brand. The inference from “Apple” to “Mac” based on their pair-wise correlation is weak and harmful to learning because of the ambiguous semantic meaning of apple. But if we consider higher order correlation, e.g., correlation among “Apple,” “Computer” and “Mac,” the inference becomes precise and helpful, e.g., {“Apple,” “Computer”}⇒ “Mac.” “Jaguar” is another similar example, which can represent either a feline or a car brand. In order to incorporate high order label correlations, we now define an auxiliary compositional label. It is composed by one or several primary labels. Revisiting the previous example, we use ys , yu and yv to represent primary labels “Mac,” “Apple” and “Computer” respectively, and assume the correlated labels of ys are yu and {yu , yv }. Two compositional labels Yt1 and Yt2 can be defined as:

The compositional label Yt1 is the primary label yu itself. The compositional label Yt2 is composed by two primary labels yu and yv , and it equals to 1 only when both yu and yv equal to 1. Then, the correlated labels of ys can be represented by compositional labels Yt1 and Yt2 , namely c(ys ) = {Yt1 , Yt2 }. We use c(ys ) and C ys = |c(ys )| to represent all the correlated compositional labels of ys and the number of the correlated compositional labels respectively. Now, with compositional label defined, the score function defined by Eq. 7 can be extended to incorporate high order label correlations: log P(yr |x i , wrt ) f (S) =

Yt1 = {yu }, Yt2 = {yu , yv }.

(x i ,yr )∈L t−1 ∪S

−α

(x j ,ys )∈U t−1 −S

1 C ys

H (Pys , PYt ).

Yt ∈c(ys )

(10) The prediction model PYt for the compositional label Yt can be leaned by treating the examples with all its primary labels as positive examples and the rest as negative examples. Since the number of the high order informative label correlations is relatively smaller than the number of the pair-wise informative label correlations, the number of the additional compositional labels is relatively small, and the extra computational costs incurred for computing the prediction models of the additional compositional labels are still affordable. C. Mining Informative Label Correlations In order to apply the defined score function, informative label correlations need to be acquired. Label co-occurrence is widely adopted as a measure to discover label correlation. Labels that co-occur frequently are regarded as highly correlated, and useful for label inferences. But discovering frequently co-occurred labels is not trivial, especially when we have a large number of labels and the size of the data is huge. With the order of the co-occurrence (how many labels co-occur together) growing, the searching space for discovering the frequent co-occurrences increase exponentially. For instance, there are 2k possible label combinations for k different labels. This is partially the reason why most of the existing multi-label classification methods only focus on low order label correlations, especially pair-wise label correlations. In order to efficiently discover the informative label correlations with any orders, we adopted a data mining method called association rule mining [6] which was originally developed in data mining area. It has attracted many attentions from other research communities since invented, such as information retrieval, machine learning and computer vision [40]–[42]. Association rule mining was originally invented to efficiently discover interesting relations among different shopping items from a large amount of customer shopping records recorded in supermarket databases. Formally, an item set I = {i t1 , . . . , i tn } denoting different products in supermarkets and a transaction set T = {tr1 , . . . , trm } denoting customer shopping records are provided. Each transaction contains a


1435

subset of I indicating the products bought in the transaction. A mined rule is an implication of the form I A ⇒ I B , where I A and I B are subsets of I . I A and I B are called antecedent and consequent respectively. For example, some typical mined rules include {“butter”}⇒{“jam”} and {“ham”, “bread”}⇒{“cheese”}. The original association rule mining method is modified here to tolerate the incomplete label information in K and a dataset the training data. Given a label set {yi }i=1 L = {(x 1 , Y1 ), . . . , (x N , Y N )}, where Y indicates a subset K , we treat {y } K of {yi }i=1 i i=1 as an item set. Each example’s labels constitute a transaction. Then the training dataset can be transformed to a transaction set T = {Y1 , . . . , Y N }. But the training dataset L may only contains partial label information for some of its examples due to the incomplete annotation on L. Correspondingly some informative label correlations may be omitted in T . In order to compensate the losses, we perform a label enrichment process on the training dataset L before it is transformed into transaction set. For each example in L, its J nearest neighbors are obtained based on their visual feature descriptions, and then its neighbors’ labels are assigned to it to compensate the missing labels. The label enriched L is transformed into transaction set T . Association rule mining is then conducted on T based on two measures, namely support and confidence: suppor t (Y A ⇒ Y B ) = suppor t (Y A ∪ Y B ), suppor t (Y A ∪ Y B ) . con f i dence(Y A ⇒ Y B ) = suppor t (Y A )

(11) (12)

Y A and Y B indicate subsets of label set, namely compositional labels. Unlike traditional association rule mining, here we restrict consequent Y B to a single primary label composed compositional label. Y A can be a compositional label with any orders. The support of Y A is defined as the proportion of the transactions in T which contains Y A : |{Yi |Yi ∈ T, Y A ⊆ Yi }| . (13) suppor t (Y A ) = |T | The minimum thresholds for both support and confidence are set to control the mining process. The higher the thresholds are set, the fewer rules are mined and more informative the mined rules are. As output of the mining procedure, a set of informative association rules R are obtained. Then for a specific label ys , all the compositional labels, which can imply ys according to the obtained rules in R, form the correlated label set c(ys ) for ys . D. Optimization Given the score function defined by Eq. 10, the optimal example-label pairs selection S ∗ should have the highest score. However, the true labels for selected example-label pairs are unknown for calculating the score. One common solution is to use current models’ predictions as label estimations. But such solution might aggravate current models’ biases, since the models are obtained on a small labeled dataset. Here we adopt the optimistic strategy that was proposed in [12]. In binary classification settings, it assigns an unlabeled example the label

(positive or negative) that is the most informative to the current models. In our case, the optimistic labels Y ∗ for a selection S are the labels that can generate the maximal f (S) score for S. So the optimal selection S ∗ and its optimistic labels Y ∗ are found simultaneously by maximizing the score function defined by the Eq. 10: (14) (S ∗ , Y ∗ ) = arg max f (S). S⊆U t−1 ,Y

In order to solve the optimization problem, a set of auxiliary t −1 (1 ≤ pair selection variables t −1 are introduced. Let k ∈ U ) represent an example-label pair in k ≤ u and u = U unlabeled example-label pair set U t −1 . u indicates the number of pairs in U t −1 . And we use k (x) and k (y) to represent the example part and the label part of the pair k respectively. Firstly, each unlabeled example-label pair k in U t −1 is expanded to two labeled example-label pairs with both positive and negative labels: k+u (x) = k (x), f or k ∈ [1, u], f or k ∈ [1, u], k (y) = +1, k (y) = −1

f or k ∈ [u + 1, 2u].

(15)

Then the auxiliary pair selection variables can be denoted as a vector q ∈ {0, 1}2u , and qk = 1 (1 ≤ k ≤ 2u) means the corresponding expanded example-label pair k is selected for annotation, otherwise not. Hereafter U˜ t −1 represents the expanded unlabeled example-labeled pair set. Now the optimal pair selection problem can be formulated as the following optimization problem over the pair selection variables q: log P(yr |x i , wrt ) arg max q (x i ,yr )∈L t−1

+β

k ∈U˜ t−1

−α

k ∈U˜ t−1

t qk log P(k (y)|k (x), w ) k (y)

1 − qk Ck (y)

H (Pk (y) , PYt )

Yt ∈c(k (y))

(16) s.t.

2u

qk ≤ m,

(17)

k=1

qk + qk+u ≤ 1 (1 ≤ k ≤ u), qk ∈ {0, 1} (1 ≤ k ≤ 2u).

(18) (19)

Note that (S ∗ , Y ∗ ) is encoded in the selection variables q. It means that q not only selects example-label pairs from U t −1 , but also guesses their optimistic labels. The second term in the Eq. 16 represents the log likelihood for selected pairs. And β is the parameter which is used to control the belief in the optimistic labels. And wt also depends on q. Because of the constraint defined by the Eq. 19, the optimization problem defined by Eq. 16 is an integer programming problem which is NP-hard. Therefore, in order to solve it in practice, we first relax the constraint Eq. 19 to: 0 ≤ qk ≤ 1

(1 ≤ k ≤ 2u).

(20)

1436


Then the problem defined by Eq. 16-19 is relaxed to a continuous optimization problem defined by Eq. 16-18 and Eq. 20. With the optimal q obtained by solving such continuous optimization problem, a greedy strategy that iteratively sets the largest q to 1 with respect to the constraints can be adopted to obtain the solution for the original integer problem. The objective function Eq. 16 of the relaxed problem is not a concave function of q. Convex optimization methods cannot be applied for global optimal solutions. But in practice, standard optimization techniques can be adopted to find a local optimal solution. Here, we adopt the Quasi-Newton method. Firstly, the objective function Eq. 16 is treated as a function of q: f (q). Given an initial assignment q0 , Quasi-Newton method can iteratively update q based on local gradients to maximally increase the value of f (q) until a local maximum is reached. Let ql denote the current assignment of q at iteration l, Quasi-Newton method first calculates the second-order Taylor approximation f˜(q) for f (q) at ql : 1 f˜(q) = f (ql )+∇ f (ql )(q−ql )+ (q−ql )T Bl (q−ql ), 2

Fig. 3.

Some sample images from multi-label natural scene dataset.

TABLE I S TATISTICS ON S CENE D ATASET (D:D ESERT, M:M OUNTAINS , S:S EA , S U :S UNSET, T:T REES )

(21)

where ∇ f (ql ) and Bl indicate the gradient vector and Hessian matrix of f (q) at point ql respectively. The optimal update direction dl at iteration l can be obtained by solving the quadratic programming problem defined by Eq. 21, Eq. 17, Eq. 18, and Eq. 20. Let ql∗ denote the solution of the quadratic programming problem, then dl = ql∗ − ql . With this update, a backtrack line search is applied to assure the improvement on the objective function f (q). It is worth to note that for each q, wt needs to be retrained on L t −1 ∪ S to calculate the new objective value. In practice, we approximate the training of wt by constraining it to several Newton steps with a starting point given by wt −1 to reduce the computational cost.

E. Parameter Adjustment The trade-off parameter α, the minimum support and the minimum confidence for association rule mining can be obtained by cross validation on the training dataset. As for the parameter β, it reflects the belief in the guessed optimistic labels. We use it to control the contribution of the optimistic labels to the objective function defined by Eq. 16. When the labeled dataset is small and the partition information in the unlabeled dataset disagrees with the true classification, it is likely that the guessed optimistic labels may not reflect the true labels. In such case, the selected example-label pairs are not able to maximize the improvement to the current models. β is then used to reduce our belief in the guessed optimistic labels and its contribution in the objective function. Namely, after each round annotation, if the true labels of the selected example-label pairs are different from the guessed labels, β is reduced by a small factor, e.g., 0.5. In contrast, if the guessed optimistic labels and the true labels are consistent, we keep the belief in them and their contributions in the objective function, and β remains to be 1.

IV. E XPERIMENT In this section, we conduct experiments on four different datasets to verify the effectiveness of the proposed method. A. Multi-Label Image Classification on Scene Dataset The first experiment is conducted on the scene dataset3 which contains natural scene images with multiple labels. Images are collected from COREL image collection and Internet. Sample images are shown in Fig. 3. It is originally used in [43]. There are 5 class labels: desert, mountains, sea, sunset and trees. Adopting the method developed in [44], each image is depicted by a bag of nine 15-dimensional feature vectors. A subset of the whole dataset is used. It contains all the 457 multi-label images and 250 single-label images. The statistic about different labels can be found in Table I. The comparison study is conducted with 5 different approaches. (1) Passive selection which randomly selects a batch of example-label pairs for each class. Binary batch mode active learning methods: (2) Discriminative Batch Mode Active Learning (DBAL) [12] and (3) Far-sighted Active Learning. (FAL) [14]. For these two approaches, examplelabel pairs are evenly selected for classes. Multi-label myopic active learning methods: (4) Two-dimensional Active Learning (TDAL) [22] and (5) Multi-task Active learning (MTAL) [23]. Batch selection is achieved by sequentially select m examplelabel pairs with highest informativeness scores. Pair-wise indicates the proposed method with only pair-wise label correlations utilized, and HoAL indicates that high order label correlations are used. For the proposed method, multiple-instance multiplelabel SVM [43] is used to generate prediction models. 3 http://lamda.nju.edu.cn/data_MIMLimage.ashx


1437

TABLE II P ERFORMANCES ON THE S CENE D ATASET BASED ON F IVE D IFFERENT M ULTI -L ABEL E VALUATION M ETRICS

Specifically, images are depicted by adopting multi-instance learning framework [45]–[47]. Each image is represented as a collection of nine instances (image patches) by following the work in [44]. Then, as described in [43], a multi-instance multi-label SVM is used to generate classification models: Constructive clustering method [48] is applied first to convert multi-instance examples to standard single-instance examples. Multi-label SVM [49] is then used to generate multi-label classifiers. All the methods perform 3 iterations of learning (initially 400 example-label pairs, then each iteration queries 100 more example-label pairs until 700 example-label pairs). The final results are averaged over 5 times random runs. Five different evaluation metrics are used including Hamming loss, one-error, coverage, ranking loss, and average precision, as used in [24], [43]. The symbol ↑ in the table means the higher the evaluation metric is the better the performance is, and the symbol ↓ works the other way around. The final results are listed in Table II. And the best results are shown in bold. From Table II, we can see that the proposed methods consistently outperform other techniques on all the evaluation metrics for all the three learning iterations. Although the original optimization problem with constraint defined by Eq. 19 is an integer programming problem which is NP-hard, the local optimal solution obtained by solving the relaxed continuous optimization problem is still very effective.

The experimental results also demonstrate: (1) although it is possible for the proposed method to make incorrect label guesses because of the misleading information presented by the unlabeled data, such situation can be immediately identified after the annotation and rectified by adjusting the belief parameter β. (2) The fact that HoAL consistently outperforms Pair-wise verifies that wise utilization of high order label correlations can help achieving more efficient and accurate multi-label image classification. B. Web Information Classification In this section, we conduct experiment on the dataset from CMU Reading the Web project.4 It contains 708 classes. Each example is represented by a 99, 400 dimensional feature vector. We select the 10 most frequent classes for the experiment. One third of the examples are used as testing examples. 30 examples are used to initialize prediction models. 5 random runs are performed. The proposed method is compared with five different approaches as stated in the first experiment. Label correlations are mined in the CMU Reading the Web dataset. The average area under the ROC curve (AUC) is calculated over all the classes. They are plotted with the number of queried examplelabel pairs as x-axis. The final results are showed in Fig. 4. Efficiencies of different methods can be measured by the slopes of the curves. 4 http://rtw.ml.cmu.edu/rtw/

1438


Fig. 4.

The learning curves for CMU Reading the Web dataset.

We can draw four conclusions from these results. First, active learning methods outperform the passive learning method. Although random selection can steadily improve the performance, it is not as efficient as other active learning approaches, which on the other hand, show fairly quicker gains. Second, batch mode active learning methods achieve faster gains than multi-label myopic active learning approaches. It implies that batch mode methods take into account the information overlapping among the selected example-label pairs. And it is the dominant factor influencing the learning efficiency. Third, the proposed method achieves the best performance. The reasons are two folds: (a) The proposed method formulates example-label paris selection as an optimization problem over the auxiliary pair selection variables, and this prevents the information redundancy in the selection examples-label pairs. (b) The proposed method takes into account the label correlations, and wisely exploiting the redundant information among different labels further improves the learning efficiency. Fourth, high order label correlation driven approach outperforms pair-wise label correlation based approach. From the above two experiments IV-A and IV-B, we can see that the proposed method consistently outperforms TDAL [22], which also considers pair-wise label correlations. Traditional myopic active learning, to which TDAL belongs, queries one example-label pair at a time. For the method to achieve batch mode active learning, it simply selects a batch of example-label pairs with the highest scores at a time. However, by doing so, it does not take into account the information overlap between the selected pairs, which leads to suboptimal queries. The proposed method is particularly designed for tackling such difficulty. It averts information redundancy. It considers not only the uncertainty but also the diversity of the selection. It is the main reason why TDAL underperforms the proposed method, although it also considers pair-wise label correlations. C. Document Categorization The third experiment is performed to show the impact of high order label correlations on MLBAL. It is conducted

Fig. 5.

The learning curves for RCV1 dataset. TABLE III

T HE D ISTRIBUTION OF THE O RDER OF THE M INED L ABEL C ORRELATIONS

on the data collected from RCV1 corpus [50], which is a benchmark dataset containing 800, 000 newswire stories from Reuters. Each article in the dataset is represented by a 47, 236 dimensional TF-IDF weight vector. There are 101 topics totally. In our experiment, only a subset of this dataset is used.5 It contains 3000 training examples and 3000 testing examples. We select the 36 most frequent classes for the experiment. Initially, we randomly select 60 labels from the training set for each class to obtain the initial models. Label correlations are mined from RCV1 corpus. The distribution of the order of the mined correlations is shown in Table III. As we can see, nearly half of the mined correlations are pair-wise correlations, and only small portion of the correlations have order larger than 5. In order to show the impact of high order label correlations on multi-label active learning, the proposed approach is performed with 3 different settings: (1) pair-wise, only pair-wise label correlations are used (2) up to order 3, both pair-wise and order 3 correlations are utilized (3) HoAL uses all the mined correlations. Like the second experiment, the proposed approach is performed 5 times on the dataset for each of the 3 settings, and the average area under the ROC curve (AUC) is computed over all the classes for evaluation. The performances of the 5 approaches used in the previous experiments, namely passive, DBAL, FAL, TDAL and MTAL, are also evaluated for comparison. The final comparison results are shown in Fig. 5, in which the virtue of utilizing high order label correlations manifests itself. The higher order correlations are utilized, the faster the learning process converges, and the fewer annotations are required for accurate classification. 5 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/


1439

Fig. 6. Average results on NUS-WIDE and NUS-WIDE-Lite. (a) Average precision for NUS-WIDE with different batch selection sizes. (b) Average precision for NUS-WIDE-Lite with different batch selection sizes. (c) Average recall for NUS-WIDE with different batch selection sizes. (d) Average recall for NUS-WIDE-Lite with different batch selection sizes.

D. CBIR on NUS-WIDE Dataset The last experiment is designed to investigate the effect of batch selection size and the effect of high order label correlations on MLBAL. We conduct content based image retrieval experiments on NUS-WIDE dataset6 [51]. It contains 269, 648 images crawled from the popular photo sharing website Flickr.7 And there are totally 5, 018 tags created by Flickr users for these images. The dataset creator extracts 81 concepts from 5, 018 tags for CBIR algorithm evaluations. Besides, a subset of NUS-WIDE (NUS-WIDE-Lite) is extracted by removing noisy tags. Our experiments are conducted on both NUS-WIDE and NUS-WIDE-Lite. Images are represented by different types of visual features including 64 dimensional color histogram, 73 dimensional edge direction histogram, 128 dimensional wavelet texture, and 225 dimensional block-wise color moments. One iteration of active learning is conducted with different batch sizes, and the obtained classifiers are used to perform CBIR. For NUS-WIDE, each image is associated with a 81 dimensional label vector indicating its belongings to the 81 concepts. An initial labeled example-label pair set (32, 000 pairs) is randomly selected and provided for training initial prediction models. Batch selection size varies from 64, 000 pairs to 224, 000 pairs with step size 32, 000 pairs. And the rest are used as testing examples. For NUS-WIDE-Lite, it contains 55, 615 images randomly chosen from NUS-WIDE. Similar as NUS-WIDE, each image in NUS-WIDE-Lite is associated with an 81 dimensional label vector. Initially 5, 000 pairs are randomly selected for training 6 http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm 7 http://www.flickr.com/

initial classifiers, then experiments are conducted on different batch sizes from 10, 000 pairs to 35, 000 pairs with step size 5, 000 pairs. Similar as before, we compare the proposed approach with four different approaches including Discriminative Batch Mode Active Learning (DBAL), Far-sighted Active Learning (FAL), Two-dimensional Active Learning (TDAL), and Multi-task Active Learning (MTAL). And multi-label SVM is used to generate prediction models for the proposed method. Informative label correlations are mined on NUS-WIDE as prior knowledge. The proposed method is performed as two versions. The first version performs learning with pair-wise label correlations only, and the second version performs learning with additional high order label correlations. For the full NUS-WIDE dataset with 269, 648 images, it takes average 28 minutes with a variance of 5.1 minutes for the proposed method (with high order label correlations considered) to find the optimal example-label pairs for annotation. The computation time is obtained by averaging the times for different batch sizes, varying from 64, 000 pairs to 224, 000 pairs with step size 32, 000 pairs. Similarly, for NUS-WIDELite, the average computation time is 4.9 minutes, and the variance is 1.2 minutes. The results are acquired by using Intel Core i7 machine with 3.4GHz CPU and 16GB RAM. Matlab is used for the implementation. The performance is obtained without code optimization. The criteria we adopted for performance evaluation are average precision and average recall for all the 81 labels. The averaged results are showed in Fig. 6. As we can see, the proposed method HoAL performs the best. And with batch size increasing, the gains are more apparent. The reason is: large size batch selection is vulnerable to

1440


of training data, has the lowest annotation cost, but suffers from insufficient classification accuracy. The proposed HoAL offers us a flexible position between human annotation and passive learning. It provides us a principled efficient way to find a suitable balance between classification accuracy and annotation cost. Fig. 7.

The balance between classification accuracy and annotation cost.

R EFERENCES information redundancy. And learning efficiency tends to reduce when selection size increases. But the proposed method directly chooses the pairs that can maximize learning objective, e.g., the score function defined by Eq. 10. Besides, it takes into account the label correlations to improve the learning efficiency. Therefore, the proposed method can suppress the negative effect of large selection size and keep the learning process efficient. Additionally, from the results we can see that the proposed method achieves more accurate performances when high order label correlations are utilized. This verifies our concerns about the usefulness of high order label correlations for multi-label batch mode active learning. V. C ONCLUSION Motivated by the virtue of leveraging label correlations to improve multi-label classification, a novel multi-label batch mode active learning approach, high order label correlation driven active learning (HoAL), is presented in this paper for multi-label image classification. It is verified to have significant improvement over the state-of-the-art active learning techniques. The main features of HoAL include: (1) Active selection is performed on example-label level rather than example level. (2) Informative label correlation is discovered by utilizing an efficient association rule mining algorithm. (3) The cross-label uncertainty on unlabeled data is defined based on KL divergence. Both single-label uncertainty and cross-label uncertainty are unified by the cross entropy measure. (4) The informative example-label pair selection is formulated as a continuous optimization problem over selection variables with the consideration of label correlations. With multi-label and batch mode characteristics considered and ingeniously handled, active learning turns practical for large scale real-world image classification. As illustrated in Fig. 2, with the proposed HoAL, parallel annotation systems, e.g., Mechanical Turk and LabelMe, can be integrated into large scale real-world image classification and retrieval system, e.g., Internet image search engine. Such practical active learning based framework provides us the flexibility to control the balance between classification accuracy and annotation cost. Additionally, the quality control for crowdsourcing is another important factor for making the application of large scale active learning realistic. It has been attracting increasing attention [52]–[54]. As shown in Fig. 7, on one hand, human annotation is able to provide the most accurate image classification, but incurs the highest cost. On the other hand, passive learning, which learns image classifiers only from the provided small amount

[1] D. Angluin, “Queries and concept learning,” Mach. Learn., vol. 2, no. 4, pp. 319–342, 1988. [2] B. Settles, “Active learning literature survey,” Dept. Comput. Sci., Univ. Wisconsin–Madison, Madison, WI, USA, Tech. Rep. 1648, 2009. [3] A. Sorokin and D. Forsyth, “Utility data annotation with amazon mechanical turk,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2008, pp. 1–8. [4] B. Russell, A. Torralba, K. Murphy, and W. Freeman, “Labelme: A database and web-based tool for image annotation,” Int. J. Comput. Vis., vol. 77, no. 1, pp. 157–173, 2008. [5] S. Kullback and R. Leibler, “On information and sufficiency,” Ann. Math. Stat., vol. 22, no. 1, pp. 79–86, 1951. [6] R. Agrawal, T. Imieli´nski, and A. Swami, “Mining association rules between sets of items in large databases,” ACM SIGMOD Rec., vol. 22, no. 2, pp. 207–216, 1993. [7] C. Campbell, N. Cristianini, and A. Smola, “Query learning with large margin classifiers,” in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 111–118. [8] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” J. Mach. Learn. Res., vol. 2, pp. 45–66, Mar. 2002. [9] Y. Freund, H. Seung, E. Shamir, and N. Tishby, “Selective sampling using the query by committee algorithm,” Mach. Learn., vol. 28, nos. 2–3, pp. 133–168, 1997. [10] R. Herbrich, N. D. Lawrence, and M. Seeger, “Fast sparse gaussian process methods: The informative vector machine,” in Proc. Adv. Neural Inf. Process. Syst., 2002, pp. 609–616. [11] K. Brinker, “Incorporating diversity in active learning with support vector machines,” in Proc. 20th Int. Conf. Mach. Learn., vol. 3. 2003, pp. 59–66. [12] Y. Guo and D. Schuurmans, “Discriminative batch mode active learning,” in Proc. Adv. Neural Inf. Process., 2007, pp. 593–600. [13] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Semi-supervised SVM batch mode active learning for image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–7. [14] S. Vijayanarasimhan, P. Jain, and K. Grauman, “Far-sighted active learning on a budget for image and video recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3035–3042. [15] A. Krause and C. Guestrin, “Nonmyopic active learning of gaussian processes: An exploration-exploitation approach,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 449–456. [16] S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 892–900. [17] Y. Guo, “Active instance sampling via matrix partition,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 802–810. [18] G. Schohn and D. Cohn, “Less is more: Active learning with support vector machines,” in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 839–846. [19] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, “Representative sampling for text classification using support vector machines,” in Proc. Adv. Inf. Retr., vol. 3. 2003, pp. 393–407. [20] S. C. Hoi, R. Jin, and M. R. Lyu, “Large-scale text categorization by batch mode active learning,” in Proc. 15th Int. Conf. World Wide Web, 2006, pp. 633–642. [21] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learning and its application to medical image classification,” in Proc. 23rd ICML, 2006, pp. 417–424. [22] G. Qi, X. Hua, Y. Rui, J. Tang, and H. Zhang, “Two-dimensional active learning for image classification,” in Proc. IEEE Conf. CVPR, Jun. 2008, pp. 1–8. [23] Y. Zhang, “Multi-task active learning with output constraints,” in Proc. 24th AAAI Conf. Artif. Intell., 2010, pp. 1–6. [24] R. Schapire and Y. Singer, “BoosTexter: A boosting-based system for text categorization,” Mach. Learn., vol. 39, nos. 2–3, pp. 135–168, 2000.


[25] A. Elisseeff and J. Weston, “Kernel methods for multi-labelled classification and categorical regression problems,” Adv. Neural Inf. Process. Syst., vol. 14. pp. 681–687, 2002. [26] K. Crammer and Y. Singer, “A new family of online algorithms for category ranking,” in Proc. 25th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr.. 2002, pp. 151–158. [27] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification using maximum entropy method,” in Proc. 28th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., 2005, pp. 274–281. [28] G. Qi, X. Hua, Y. Rui, J. Tang, T. Mei, and H. Zhang, “Correlative multi-label video annotation,” in Proc. 15th Int. Conf. Multimedia, 2007, pp. 17–26. [29] Y. Sun, Y. Zhang, and Z. Zhou, “Multi-label learning with weak label,” in Proc. 24th AAAI Conf. Artif. Intell., 2010, pp. 593–598. [30] B. Hariharan, L. Zelnik-Manor, S. Vishwanathan, and M. Varma, “Large scale max-margin multi-label classification with priors,” in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 1–8. [31] R. Yan, J. Tesic, and J. Smith, “Model-shared subspace boosting for multi-label classification,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2007, pp. 834–843. [32] S. Godbole and S. Sarawagi, “Discriminative methods for multi-labeled classification,” in Proc. Adv. Knowl. Discovery Data Mining, 2004, pp. 22–30. [33] B. Zhang, Y. Wang, and W. Wang, “Batch mode active learning for multi-label image classification with informative label correlation mining,” in Proc. IEEE WACV, Jan. 2012, pp. 401–407. [34] J. Shore and R. Johnson, “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,” IEEE Trans. Inf. Theory, vol. 26, no. 1, pp. 26–37, Jan. 1980. [35] J. Goldberger, S. Gordon, and H. Greenspan, “An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures,” in Proc. 9th IEEE Int. Conf. Comput. Vis., Oct. 2003, pp. 487–493. [36] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. 22nd Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., 1999, pp. 50–57. [37] S. Tong and D. Koller, “Active learning for parameter estimation in Bayesian networks,” in Proc. NIPS, vol. 13. 2000, pp. 647–653. [38] A. McCallum, K. Nigam et al., “Employing EM and pool-based active learning for text classification,” in Proc. ICML, vol. 98. 1998, pp. 350–358. [39] Z. Xu, R. Akella, and Y. Zhang, “Incorporating diversity and density in active learning for relevance feedback,” in Proc. Adv. Inf. Retr., 2007, pp. 246–257. [40] B. Zhang, G. Ye, Y. Wang, W. Wang, J. Xu, G. Herman, et al., “Multiclass graph boosting with subgraph sharing for object recognition,” in Proc. 20th Int. Conf. Pattern Recognit., 2010, pp. 1541–1544. [41] B. Zhang, G. Ye, Y. Wang, J. Xu, and G. Herman, “Finding shareable informative patterns and optimal coding matrix for multiclass boosting,” in Proc. IEEE 12th ICCV, Sep./Oct. 2009, pp. 56–63. [42] M. Yang, Y. Wu, and S. Lao, “Intelligent collaborative tracking by mining auxiliary objects,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun. 2006, pp. 697–704. [43] Z. Zhou and M. Zhang, “Multi-instance multi-label learning with application to scene classification,” in Proc. 20th Annu. Conf. NIPS, 2006, pp. 1–8. [44] O. Maron and A. Ratan, “Multiple-instance learning for natural scene classification,” in Proc. 15th ICML, 1998, pp. 341–349. [45] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles,” Artif. Intell., vol. 89, no. 1, pp. 31–71, 1997. [46] G. Herman, G. Ye, J. Xu, and B. Zhang, “Region-based image categorization with reduced feature set,” in Proc. IEEE 10th Workshop Multimedia Signal Process., Oct. 2008, pp. 586–591. [47] B. Zhang, Y. Wang, and W. Wang, “Multiple-instance learning from multiple perspectives: Combining models for multiple-instance learning,” in Proc. IEEE WACV, Jan. 2012, pp. 481–487.

1441

[48] Z.-H. Zhou and M.-L. Zhang, “Solving multi-instance problems with classifier ensemble based on constructive clustering,” Knowl. Inf. Syst., vol. 11, no. 2, pp. 155–170, 2007. [49] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771, 2004. [50] D. Lewis, Y. Yang, T. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” J. Mach. Learn. Res., vol. 5, pp. 361–397, Dec. 2004. [51] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: A real-world web image database from national university of singapore,” in Proc. ACM Int. Conf. Image Video Retr., 2009, p. 48. [52] O. Alonso, D. E. Rose, and B. Stewart, “Crowdsourcing for relevance evaluation,” ACM SigIR Forum, vol. 42, no. 2, pp. 9–15, 2008. [53] M. Lease, “On quality control and machine learning in crowdsourcing,” in Proc. Human Comput., 2011, pp. 97–102. [54] M. Allahbakhsh, B. Benatallah, A. Ignjatovic, H. R. Motahari-Nezhad, E. Bertino, and S. Dustdar, “Quality control in crowdsourcing systems: Issues and directions,” IEEE Internet Comput., vol. 17, no. 2, pp. 76–81, Mar./Apr. 2013.

Bang Zhang is a Research Engineer of the Machine Learning Research Group, National ICT Australia. He received the B.S. degree in computer science from the University of Sun Yat-Sen, China, in 2004, and the M.S. and Ph.D. degrees in computer science from the University of New South Wales, Australia, in 2006 and 2013, respectively. His research interests include pattern recognition, machine learning, image processing, and computer vision.

Yang Wang is a Senior Researcher of the Machine Learning Research Group, National ICT Australia (NICTA). He received the Ph.D. degree in computer science from the National University of Singapore in 2004. Before joining NICTA in 2006, he was with the Institute for Infocomm Research, Rensselaer Polytechnic Institute, and Nanyang Technological University. He has published more than 60 international conference and journal papers on pattern classification and computer vision. His research interests include machine learning and information fusion techniques and their applications to intelligent infrastructure, cognitive and emotive computing, and image and video analysis.

Fang Chen received the Ph.D. in Communications and Electronic Systems from Beijing Jiaotong University, China. She is currently a Senior Principle Researcher with National ICT Australia, Sydney. She is a Conjoint Professor with the University of New South Wales. Her main research interests are human machine interaction, especially in multimodal systems, cognitive load modeling, speech processing, natural language processing, user interface design, and evaluation.

Learning Instance Correlation Functions for Multilabel Classification.

Multiview vector-valued manifold regularization for multilabel image classification.

Active drift stabilization in three dimensions via image cross-correlation.

Multi-instance multilabel learning with weak-label for predicting protein function in electricigens.

Learning classification models with soft-label information.

Materials Prediction via Classification Learning.

Spatial coherence-based batch-mode active learning for remote sensing image classification.

Augmenting multi-instance multilabel learning with sparse bayesian models for skin biopsy image analysis.

Coupled dimensionality reduction and classification for supervised and semi-supervised multilabel learning.

Blind image quality assessment via deep learning.

Blind Image Blur Estimation via Deep Learning.

Guaranteed classification via regularized similarity learning.

High-throughput label-free image cytometry and image-based classification of live Euglena gracilis.

Feature Screening via Distance Correlation Learning.

Automated annotation of functional imaging experiments via multi-label classification.

Image interpolation via graph-based Bayesian label propagation.

Active Microscope Stabilization in Three Dimensions Using Image Correlation.

Multilabel user classification using the community structure of online networks.

On multilabel classification methods of incompletely labeled biomedical text data.

Active Learning of Classification Models with Likert-Scale Feedback.

A novel classification method of halftone image via statistics matrices.

CLASSIFICATION OF TUMOR HISTOPATHOLOGY VIA SPARSE FEATURE LEARNING.

Supervised machine learning and active learning in classification of radiology reports.

Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments.