This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 1

Query-Adaptive Multiple Instance Learning for Video Instance Retrieval Ting-Chu Lin, Min-Chun Yang, Chia-Yin Tsai, and Yu-Chiang Frank Wang

Abstract—Given a query image containing the object of interest (OOI), we propose a novel learning framework for retrieving relevant frames from the input video sequence. While techniques based on object matching have been applied to solve this task, their performance would be typically limited due to the lack of capabilities in handling variations in visual appearances of the OOI across video frames. Our proposed framework can be viewed as a weakly supervised approach, which only requires a small number of (randomly selected) relevant and irrelevant frames from the input video for performing satisfactory retrieval performance. By utilizing frame-level label information of such video frames together with the query image, we propose a novel query-adaptive Multiple Instance Learning (q-MIL) algorithm, which exploits the visual appearance information of the OOI from the query and that of the aforementioned video frames. As a result, the derived learning model would exhibit additional discriminating abilities while retrieving relevant instances. Experiments on two real-world video datasets would confirm the effectiveness and robustness of our proposed approach. Index Terms—Object detection, object matching, multiple instance learning, weakly supervised learning

I. I NTRODUCTION The amount of online videos is exploding in the past decade due to the development of internet and multimedia technologies. Take YouTube [1] for example, over 72 hours of video are uploaded every minute. However, video search and retrieval mostly rely on the use of the tags annotated by the users, which might not always be available or relevant to the video content. Thus, it is still a very challenging task when it comes to the search of specific video segments containing the objects of interest (OOI), which is a crucial task for several videobased applications including embedded marketing [2], [3] (see Figure 1 for example). In practice, the content providers might not always know exactly when (and how often) particular items are embedded in their videos. Such information, even if This work is supported in part by Industrial Technology Research Institute (ITRI) Advanced Research Program via 102-EC-17-A-01-05-0337, and National Science Council via NSC102-3111-Y-001-015 and NSC102-2221E-001-005-MY2. Ting-Chu Lin was with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan, and is currently with the Department of Computer Science at Columbia University, New York, USA (email: [email protected]). Min-Chun Yang is with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan (email: [email protected]). Chia-Yin Tsai was with the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, and is currently with the Department of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, USA (email: [email protected]). Yu-Chiang Frank Wang is with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan (email: [email protected]).

Object of Interest

Relevant Video Frames

(a)

(b) Fig. 1. Challenges in practical video instance search problems: (a)

The red bounding boxes denote the ground truth regions of the object of interest (OOI), while the yellow ones are the incorrectly detected ones using weakly supervised learning methods (e.g., [4]). (b) Significant appearance variations of the OOI are presented, and thus standard image matching approaches cannot perform successful detection. Note that all commercial logos in this paper are blocked by green rectangles in accordance with copyright regulations.

available in the scripts, might not be successfully transferred to TV stations, cable companies, or online video platforms which broadcast/deliver the videos. For old movies or TV series, the aforementioned information is obviously not available either. Without the ability to automatically search for the OOI across video frames based on the query input, it will be challenging for the advertisers or video deliverers to design and provide interaction services in future smart digital TV applications. In order to solve the above tasks, methods based weakly supervised learning [5], [4], [6], [7], [8], [9], [10], [11] have been applied. The idea of weakly supervised object detection or localization aims at designing object detectors using only label information at image/frame levels labels. In other words, instead of collecting ground truth pixel label information for the OOI, one only acquires video frames which are with and without those particular objects presented for learning the classifiers or detectors. Once these classifiers or detectors are obtained, they can be applied to recognize/match similar objects across frames in query videos. For such weakly supervised learning settings, one typically assumes that the regions of interest share similar visual features across training video frames containing the same OOI [5], [12]. However, due to the lack of ground truth information of the associated features at the pixel level, the above assumption might fail due to the noise, cluttered backgrounds, or other objects presented (see examples shown in Figure 1(a)). On the other hand, if training data can be collected in advance, one can apply matching-based approaches for solving

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 2

the problems of video instance search [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23]. With the use of training data, such detection or matching-based methods are generally expected to perform better than those with weakly supervised strategies. Unfortunately, when matching or searching for similar object instances from test videos, one might observe significant visual appearance variations of the OOI (e.g., Fig. 1(b)) ,which cannot be well described by training data. Nevertheless, it is typically not possible to have the prior knowledge for acquiring a sufficient amount of training data, and thus the above techniques might not be applicable. In this paper, we propose a novel learning framework based on multiple instance learning (MIL), which integrates the aforementioned weakly supervised and object matching based strategies. Given a query image containing the OOI, our approach only requires one to provide label information for few video frames (e.g., annotating 3 to 6 randomly selected video frames with or without the OOI). With the above query image and selected video frames, we present a novel MIL algorithm with additional constraints on preserving the discriminating ability. This allows us to improve the detection performance of prior MIL-based video instance search approaches. Compared to prior object-matching based methods, our approach utilizes the query image and the input video itself, and there is no need to collect training data for the OOI. As a result, our self-training/detection strategy makes our proposed framework more preferable for practical uses. As an additional remark, the success of video instance search would also be of great interests to the law enforcement units, in which the officials and executors often require to identify suspects and crime scenes from different multimedia resources. Take child pornography for example, the law enforcement needs to locate the victim, suspect, or the crime scene based on all possible information and data received (e.g., photos of the victim or crime scenes, a video shot, etc.). It is obvious that, a good computer vision solution to search for (or narrow down) possible video frames given a particular picture of interest would be crucial in the above scenarios. This is why, in addition to embedded marketing, the method proposed in our work would be of practical interests and uses. II. R ELATED W ORK To search for particular objects of interest throughout a video, one typically applies classifiers or detectors learned on pre-collected training data, or simply performs object matching given the query image of that object. We now briefly review these two types of approaches as follows. A. Detection-based Approaches For object detection or localization problems [24], [25], one needs to collect training images of different object categories, with ground truth information given at pixel or boundingbox levels. Once the learning of the associated detectors or classifiers is complete, one can apply them to detect the objects of interest from test images or video frames. For example, Felzenszwalb et al. [25] introduced an object detection framework based on mixtures of multi-scale deformable part models,

which rely on discriminative training with bounding boxes for each object of interest. For training video frames with frame level ground truth information (i.e., only the presence of the particular object, but no location information is available), a weakly supervised setting will be required for designing the classifiers. Considering that image regions of the same object would share similar visual features, recent works like [6], [5], [7], [26], [11] proposed to locate the regions of interest through different techniques. For example, Pandey et al. [6] applied deformable part-based models with latent SVMs for discovering the spatial structure of the object interest across images. Chum and Zisserman [5] updated the region of interest by searching for the most discriminative visual features. Deselaers et al. [7] advanced conditional random fields (CRF) to learn appearance models for describing the object of interest. Multiple instance learning (MIL) [27] has also been widely applied for solving object localization problems with weakly supervised settings, in which multiple potential regions of interest might be presented in each image/frame [8], [4], [9], [11], [10]. Note that when solving video instance search problems with MIL, the images (or video frames) and the associated regions of interest can be viewed as the “bags” and the corresponding “instances” in MIL. For example, Liu et al. [4] presented a framework for video object summarization. Given 1-3 positive frames containing the target object and 1-3 negative ones, it aimed at discovering the object of interest across video frames. Russakovsky et al. [9] proposed an object-centric spatial pooling approach, which learns the location of the object and pools the associated features for designing the classifiers. Hoai et al. [11] proposed to learn a localization-classification SVM classifier with weakly labeled data for object categorization, but they required the localization of the subwindow in each image for maximizing the SVM scores. For exploring the relationship between images and segmented regions, Li et al. [10] advocated graph-based multiinstance learning by simultaneously involving both image and region level information for object-based image retrieval. In our work, we only require image (frame) level information for retrieving the relevant frames containing the OOI. B. Matching-based Approaches Object retrieval or recognition methods have been proposed for searching particular objects through videos. To retrieve video frames in which the OOI is presented, one needs to extract representative features from the query image, so that the top matches across video frames can be identified. Image descriptors such as SIFT [28] and SURF [29], or the associated bag-of-word (BoW) [30] models are among the most popular image representations for performing the above task. To further improve the effectiveness and efficiency, techniques of term frequency inverse document frequency (tf-idf) and inverted indexing [13], [14], [15] can be further deployed. Although the use of local image descriptors or BoW models have been shown to produce promising matching/retrieval result, their disregard of spatial information of the associated descriptors or visual words would limit the performance, especially if the

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 3

+

?

-

-

-

Neg. Frames

+

-

Target Video

+

+

Feature Representation with Window Proposals

QueryAdaptive MIL (q-MIL)

+

Training

Pos. Frames

Detection Query Image Fig. 2. Our proposed framework for performing improved video instance search using a single query image and few annotated frames. Note

that we only require one to annotate few videos frames as positive or negative ones, instead of providing ground-truth label information at the pixel level. Proposed feature representation is described in Section III-A, and details of q-MIL algorithm are presented in Section III-B.

objects in target images or video frames are with rotation, etc. variations. To encode geometric information of the OOI within an image, many feature representations and similarity metrics have been proposed. For example, Philbin et el. [16] applied SIFT descriptor matching and RANSAC-based spatial reranking techniques for refining their searching results. Chum at al. [17] introduced a query expansion method, which imposes additional spatial constraints on the top matched images. To preserve spatial information of local image descriptors, Cao et al. [18] proposed to pool them from an image in particular spatial orders. On the other hand, Xie et al. [19] chose to formulate the similarity between two images as a graph matching problem for improved performance. Instead of explicitly dividing an image into different regions for pooling, the cooccurrence of visual words were also utilized to improve the retrieval tasks [20]. Some image copy detection methods also aimed at searching the target images/videos for locating the common OOI, which might be (partially) duplicated from the query [21], [22]. Nevertheless, if the OOI in the query image or in target video frames have noisy descriptors, or the OOI suffers from a variety of visual variations (e.g., shift, scale, rotation, or changes in video resolution), it would still be very challenging for performing instance search based on a single query input [23]. In our work, we assume that the query image only contains one OOI. Based on this assumption, a robust video instance search approach which is adaptive to the query input would be expected to produce improved retrieval performance. III. V IDEO I NSTANCE R ETRIEVAL For the scenario of detecting or retrieving particular objects from an input video, we propose a novel weakly supervised learning algorithm based on multiple instance learning (MIL), as illustrated in Figure 2. Given the query image containing only the OOI, our approach only requires the user to randomly annotate one or few positive or negative frames from the input video, which can be viewed as positive or negative bags when deriving our learning algorithm. Once this derivation process is complete, the labels of the remaining video frames

Query Image

Image Segmentation

Boundary-preserving Dense Local Regions (BPLRs)

(a)

...................................................................................................................................................... Input Video

Video Segmentation

BPLRs

(b)

Fig. 3. Segmentation and BPLR extraction for feature representation.

(a) Image segmentation for the query image. (b) Video segmentation for the input video.

can be predicted accordingly. We first describe the feature representation in Section III-A. Details of our proposed MILbased algorithm will be presented in Section III-B, followed by the process of instance search summarized in Section III-C. A. Feature Representation 1) Video segmentation: While local image descriptors like SIFT or SURF have been widely applied for representing images, such features are not robust to view point changes, which are widely observed in video data. For video segmentation in our work, we apply the approach of Grundmann et al. [31], which is a scalable segmentation technique using hierarchical group-based spatio-temporal segmentation algorithm for videos. Herein, we perform an initial and efficient video segmentation (with both spatial and temporal information considered) on the input video. We advance Boundary Preserving Dense Local Regions (BPLR) [32] for refining and representing the extracted video segments, due to its capability of detecting dense local image regions while preserving context details like object boundaries. Each BPLR segment is described by the corresponding appearance (SIFT), shape (PHOG) [33], texture (LBP) [34], and color information (CIELab). The integration of the above features is to provide a robust joint feature representation for each BPLR.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 4

Top K Matches

Genuine Bag (Query Image)

Sliding Windows

[ BPLR1 ]

[ BPLR10 ]

. . .

. . .

Negative Bag (Negative Frame)

-- -

-- -+

+ + + + + +

[ BPLR2 ] . . .

Positive Bag

(Positive Frame)

++ +

Candidate Window

Fig. 4. Frame-level feature representation for MIL learning.

Fig. 5. Examples of different frame/bag labels for our query-adaptive

multiple instance learning (q-MIL).

in which rqa and rib are the ath and bth BPLRs in Rq and Ri , respectively. We have features frqa and frib as the corresponding representations, which are the normalized concatenations of different types of features as discussed above. From (1), the top K matched BPLRs in the ith video frame can be determined. As depicted in Figure 4, we select the candidate windows (i.e., window proposals) by scanning this video frame in a sliding window fashion. A window proposal will be identified, if the number of the BPLRs within the bounding box is more than half of that of the query input. We note that, from our observations, the number of K is not crucial, since we obtained comparable retrieval performance

-

(1)

- -

+ + + + + +

+ + + + + +

- -

Genuine Bag

+ + +

- -

- -- - - -- - - Negative Bags

-

dist(rqa , rib ) = kfrqa − frib k2 ,

+ + +

2) Frame-level feature representation with window proposals: As discussed earlier, our proposed learning algorithm is based on MIL, and thus each video frame can be viewed as a bag for MIL. Although we have the image segments extracted from each frame, there are generally a large number of segments in each frame, and thus we do not directly take them as MIL instances. In order to identify the candidate windows covering a group of BPLRs as the instances in each bag/frame of the input video, we need the BPLR segments extracted from the query image for performing this task. Since there is only one OOI in the query image, this selection process would eliminate redundant BPLRs in each video frame, and thus we can group the remaining candidate ones in bounding boxes (as window proposals) as the instances for performing MIL learning. To achieve the above goal for deriving the window proposals, let us denote Rq as the set of BPLRs of the query q, and Ri as that for the ith video frame. We calculate the distance between each BPLR in Rq and that in Ri as

Positive Bags

+ + +

For segmenting the query input image, we apply gPbOWT-UCM of [35], which performs hierarchical image segmentation while integrating local image contours and region information. Once the segments are extracted from the query image, the same BPLR representation is applied to represent each segment. This segmentation process for both query and video frames is illustrated in Figure 3. As discussed later in Section III-B, our proposed learning algorithm will automatically identify and weight the features based on the query image, and thus the video frames containing the same OOI (as the query does) will be retrieved accordingly.

- -

+ + + + + +

- -

- -- - -

(a)

+ + + + + +

+ + +

- -

- -- -

(b)

Fig. 6. Comparisons of (a) the standard MIL and (b) our q-MIL.

Note that additional constraints (see Section III-B) are imposed on our q-MIL for successfully classifying the instances in the genuine bag (i.e., the query) as the positive ones, so that the classifier learned from the input data is better adaptive to the query input.

using K = 3 − 20 in our experiments. To alleviate computational load, we choose K = 5 in our work. Moreover, in order to deal with possible scale variations for the OOI, we duplicate the window proposals at the same location with different sizes (but with the same aspect ratio as the query image is) during the above selection process. We note that, given a query image input, our proposed framework requires the user to randomly select one or few positive and negative frames from the video sequence for training purposes. For the positive frames (randomly) chosen by the user, we apply the distance function determined in (1) for identifying window proposals (i.e., instances). Since only frame-level label information is available, our q-MIL will later determine the labels of such window proposals and learn the associated object detector (see Section III.B). As for the negative frames, we randomly select the BPRL segments as the window proposals, which are all negative instances based on the definition of MIL. Finally, for the jth window in video frame i (or query q), we denote its feature representation as xij (or xqj ). Once the window proposals are determined, we apply the same BPLR features as their representations. B. Query-Adaptive Multiple Instance Learning We propose a query-adaptive multiple instance learning (q-MIL) algorithm for addressing the video instance retrieval problem. While one can apply existing MIL algorithms like [8], [4], [9] for solving this task (i.e., using the randomly

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 5

selected positive and negative frames as positive and negative bags for MIL learning), it is not clear how to extend these MIL-based approaches for adaptively taking both the query input (with only OOI presented) and the input video into consideration. In this subsection, we will detail our proposed q-MIL algorithm for solving this task, whose superiority over existing matching or MIL-based approaches will be verified later in our experiments. 1) Problem Formulation: Different from video object summarization using MIL, the task of video instance retrieval needs to handle a query input containing the OOI, and the input video whose frames are to be retrieved based on the query. We propose a query-adaptive MIL (q-MIL) for taking both information into consideration. Aiming at deriving a novel MIL algorithm which is query specific and adaptive to the input video, our q-MIL is expected to better retrieve video frames which are relevant to the OOI. To utilize the OOI in the query image, we consider and determine this image as a “genuine bag”, in which all instances (i.e., all windows) are positive, as shown in Figure 5. Thus, in addition to the positive and negative frames/bags in the input video, our q-MIL will deal with three different types of bags when discriminating between positive and negative instances. It is worth noting that, the genuine bag should be correctly classified, since it corresponds to the OOI and contains only positive instances. Compared to positive and negative frames randomly selected from the input video, the incorporation of this genuine bag information would provide our q-MIL a better discriminating capability when performing classification, as illustrated in Figure 6. Our proposed approach learns a window detector C for the OOI, which estimates how likely it is that each window in the input video frames contains the OOI. To be more specific, given a query q, our q-MIL measures the relevance score sij of the jth window proposal in video frame i by C in terms of a weighted sum of weak learners: X sij = C(xij ) = λt ct (xij ), ct (·) ∈ {−1, +1}, (2) t

where xij is the joint feature representation of window j in frame i, and ct indicates the tth weak learner (we consider decision stumps [36] as the weak learners in our q-MIL). Now we determine the training set I for this retrieval process. Our training set consists of the query (as the genuine bag), and the positive/negative frames randomly selected from the input video (as positive/negative bags). Thus, we determine the corresponding label yi of each bag i as   0 for genuine bags 1 for positive bags yi =  1 for negative bags. 2 It is worth repeating that, our proposed approach can be considered a weakly supervised framework, in which only frame-level labels are annotated. Without the label information for the instances (i.e., window proposals) in each frame, we advance the Noisy-OR function [37] for modeling the relationship between instances in the positive and negative bags,

and logistic regression is applied to measure the relationship between the genuine (or positive) bags and the negative ones. As a result, our q-MIL determines the likelihood assigned to the positive/negative frames and the query image as follows: L(C) = Lm (C) + αLr (C),

(3)

where Lm characterizes the Noisy-OR model for the conventional MIL likelihood, and Lr calculates the cost of the logistic regression function for query adaptation, penalized by the weight α. Training the proposed q-MIL model is to maximize the above likelihood function, and this can be achieved by applying the optimization technique of AnyBoost [38]. 2) Definitions of the likelihood functions: To solve the optimization problem of (3), we first define the Noisy-OR model Lm as follows:   X Lm (C) = yi · ln pvi i · (1 − pi )(1−vi ) , (4) i∈I

where I is the training set containing genuine, positive, and negative frames/bags. pi is the probability of the i bag being positive, and vi ∈ {0, 1} denotes the label of bag i. We have vi = 1 and 0 for positive and negative bags, respectively. We note that the genuine bag (i.e., the query input) makes no contribution for Lm , since the corresponding yi = 0. It can be seen that Lm (C) will be maximum when pi = vi , which means that pi = 1 are observed for all positive bags, and pi = 0 for all negative bags. From the above formulation, the label of bag i can be calculated as vi = 2yi − 1, and the probability of bag i being a positive bag is thus modeled by Noisy-OR: Y pi = 1 − (1 − pij ), (5) j∈i

where pij denotes the probability of an instance xij is positive (i.e., relevant to the OOI in query q). Since individual pij cannot be validated without instance-level label information, this estimated probability value will be propagated to the bag/frame level by the above Noisy-OR model. As a result, pij can be estimated by the following logistic function: pij =

1 . 1 + e−C(xij )

As for Lr in (3), it is determined as   X X Lr (C) = (yi − 1) ·  `ij , i∈I

(6)

(7)

j∈i

where `ij denotes the logistic loss of xij . It can be observed that the positive bags would make no contribution for this likelihood since yi = 1. Since all the instances in the genuine bag are positive, and those in the negative bags are negative, we consider the logistic loss to measure the misclassification probability as suggested by [39], and the resulting cost function of C on instance xij is defined as   `ij = ln 1 + e−ui ·C(xij ) , (8)

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 6

Algorithm 1: Our Proposed Query-Adaptive MIL Input: Training set I = {genuine bag (query image); positive/negative bags (frames)} Output: Window detector C begin t ← 0; C ← 0; ωij ← 1 ∀ i, j; while t < T do Choose ct ← Solve (9); if ∇L(C) · ct < 0 then Break; Line search λt ← Solve (11); Update C ← C + λt ct ; Update ω ← By (10); t ← t + 1; Return C

Bridgestone

Coke

Doritos

Gillette

Heinz

Lay’s

M&M

Oikos

SK-II

Starbucks

Fig. 7. Example query images and video frames of the commercial 1

video dataset.

λt for this selected weak learner ct , i.e., λt = arg max L(C + λct ).

where ui ∈ {−1, +1} is the instance label (i.e., ui = +1 for instance in genuine bags, and ui = −1 for those in negative bags), and thus we have ui = 1 − 4yi . 3) Optimization: We now provide the optimization details for training. To learn the window detector C which maximizes the likelihood L(C), we advance AnyBoost which applies gradient ascent optimization for solving (3). To be more precise, such optimization techniques allows one to select a new weak learner ct into the current set C for maximizing the likelihood L(C + λt ct ).Since ct (.) ∈ {+1, −1} is chosen from a restricted hypothesis set H of weak learners, it is very likely to not being able to find a ct which is exactly equal to ∇L(C) = ∂L(C) ∂C . Thus, we search for a ct with the largest inner product value with ∇L(C) instead. This would approximate the Taylor series polynomial of L(C + λt ct ) ≈ C(F ) + λt ∇L(C) · ct , in which the most significant increase of L(C) occurs when ct maximizes ∇L(C) · ct . Since the L(C) is a differentiable real-valued function with respect to C and ct (.) ∈ {+1, −1}, solving the above inner product maximization problem is equivalent to solving the optimal ct in each iteration t X ct (·) = arg max ωij · c(xij ), (9) c in H

i,n

where the weights ωij on each weak learner is given as the derivative of the likelihood function with respect to a change in the score of the instance, which is calculated as ωij

=

∂L(C) ∂C(xij )

=

∂(Lm (C) + αLr (C)) ∂C(xij )

(2yi − 1) − pi yi pij pi e`ij − 1 − α `ij (yi − 1)(1 − 4yi ). (10) e P With ∇L(C) · ct = i,j ωij · ct (xij ) and the weights derived, a line search is performed to seek for the optimal coefficient =

λ

(11)

As the final step of this optimization, we update C by C ← C + λt ct . This iterative optimization process would terminate when it exceeds the maximum number of iterations T , or when ∇L(C) · ct < 0 (which indicates that ct no longer points toward the uphill direction of the likelihood function L(C)). The optimization process for training our q-MIL is summarized in Algorithm 1. C. Object Search in Videos Given the query and the input video, the above process allows us to derive the window-level detector C for the OOI in a weakly supervised fashion. Once the detector for the OOI is obtained, it can be applied to perform automatic object detection throughout the entire input video sequence. Recall that each video frame will be first segmented and represented in terms of BPLR (as described in Section III-A). As illustrated in Figure 2, the window-level detector will classify the resulting window proposals across video frames (as positive or negative). To be more precise, for window proposal j in frame i, the derived detector C will estimate how likely it contains the OOI and thus produces a positive relevance score sij . For each frame i, the maximum value among all sij scores indicates the ranking score (i.e., si = maxj∈i sij ) for this frame when retrieving the OOI. Finally, a ranking list of video frames can be obtained, and video frames containing at least one positive window proposal will be considered as positive frames with OOI presented. IV. E XPERIMENTS To evaluate the performance of our proposed method, we consider two datasets in our experiments: a collection of commercial videos, and the TRECVID dataset. We compare our proposed q-MIL algorithm with matching-based or weakly supervised instance search approaches. A. Commercial Video Dataset 1) Settings and Features: Since the task of video instance retrieval is particularly favorable for searching particular instances embedded in commercial videos, we collect 10 recent

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 7

TABLE I P ERFORMANCE COMPARISONS ( IN TERMS OF MAP) ON THE COMMERCIAL VIDEO DATASET.

Query Image Training Frames Bridgestone Coke Doritos Oikos M&M SK-II Lay’s Gillette Heinz Starbucks MAP

SIFT Matching [28] X 0.523 0.392 0.464 0.440 0.774 0.320 0.425 0.338 0.705 0.668 0.505

SIFT + Spatial Code [22]

X 0.502 0.363 0.422 0.333 0.761 0.302 0.251 0.279 0.225 0.396 0.384

BPLR + PHOG [32] X 0.471 0.535 0.584 0.858 0.615 0.200 0.746 0.704 0.153 0.663 0.553

MIL + MSER [4]

Boost + Exemplar [5]

X(3+, 3−) X(3+, 3−) 0.465 0.468 0.455 0.334 0.457 0.549 0.628 0.407 0.767 0.638 0.466 0.239 0.539 0.195 0.575 0.357 0.382 0.061 0.462 0.462 0.520 0.371

Fig. 8. ROC curve of different methods for the commercial video

dataset. TPR and FPR denote true positive and false positive rates, respectively.

commercial videos online, in which 8 of them were Super Bowl commercial videos broadcasted in 2011 and 2012, and the other 2 were collected from Asian TV programs. Each commercial video has a commercial product embedded, and the associated product images collected from their official websites are taken as query images. For commercial videos, one can observe significant variants of the embedded products across video frames (see product and video frame examples in Figure 7). We determine the ground truth positive frames, if more than 50% of the area of the product is visible. Then, we manually annotate the video frames which contain the commercial product (i.e., the OOI) as positive frames, and those without the product presented as negative ones. The duration of each commercial video is about 30 - 60 seconds, and we extract video frames by 4 5 fps for performing the retrieval task. The longest width of each video frame is resized to 300 pixels for normalization purposes, while the aspect ratio is fixed. For the query image,

Ours− (without Lr )

Ours− (without Lm )

Ours

X X(3+, 3−) 0.578 0.645 0.700 0.699 0.898 0.494 0.735 0.660 0.478 0.570 0.646

X X(3+, 3−) 0.681 0.621 0.700 0.696 0.870 0.531 0.825 0.718 0.516 0.635 0.679

X X(3+, 3−) 0.658 0.688 0.739 0.726 0.924 0.576 0.872 0.716 0.677 0.692 0.727

we resize the longest side to 100 pixels. In our work, we apply the code of [31] to perform video segmentation on the commercial videos, and that of [35] for segmenting the input query image. With each video frame (or the query image) divided into 8×8 pixel grids, the code of [32] is applied to calculate the BPLR as the final local image descriptors, while each BPLR is described by the following features: • PHOG We capture shape information by PHOG (pyramid histogram of oriented gradients) descriptors for each BPLR, as suggested in [32]. A total of 3 pyramid levels and 8 orientation bins are considered, and thus a (1+22+42)× 8 = 168 dimensional vector is obtained. • Color We calculate the color histogram for each BPLR in the CIELab space. In our work, we quantize each channel into 23 bins and derive a 69-dimensional vector for describing the color information for each BPLR. This color space has been observed to provide promising discriminating ability as suggested by [40]. • LBP We consider the LBP (local binary pattern) features for representing the textural information for each BPLR. We calculate the histograms of uniformly rotation-invariant LBP8,1 features, and a 58-dimensional LBP feature vector for each BPLR will be derived. • SIFT Finally, we extract dense gray-scale SIFT (scale-invariant feature transform) descriptors for describing the visual appearance information for each BPLR. The SIFT features are extracted from different sizes of patches (4 × 4, 6 × 6, 8 × 8, and 10 × 10 pixel) [41], and the space between adjacent patches is 3 pixels. The extracted SIFT descriptors are encoded by a codebook with 100 visual words (constructed by k-means clustering), and thus a 100-dimensional vector will be resulted. Once all the above features are constructed, we normalize the concatenated features by re-scaling them into zero mean and unit variance as our final BPLR descriptor for performing instance retrieval.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 8

9013 (IKEA)

9015 (black suit)

9016 (zebra line)

9017 (KLM)

9019 (Kappa)

9020 (UMBRO)

9021 (tank)

9022 (bus)

Fig. 9. Example query images and video frames in the TRECVID

dataset.

We apply the mean average precision (MAP) as the metric for evaluating the retrieval performance of different approaches. We note that, for weakly supervised approaches in Table I, we randomly select three positive and negative video frames for learning the associated classifiers or detectors, and we perform their experiments in 5 random trials. Table I lists the MAP scores of different methods (with standard deviations range from 0.01 to 0.10). From this table, we see that our proposed q-MIL achieved the best retrieval performance among all. This is because that, matching-based methods only utilized a query image for matching, which were not robust to variations of the OOI observed across video frames. As for other weakly supervised-based methods considered, their object classifiers or detectors were simply trained on 3 positive and 3 negative frames, and did not utilize query information as ours did. Our q-MIL advances the query image as the genuine bag, and introduces an additional likelihood function Lr for learning the object detector. To further verify the contribution of this proposed term, we conduct experiments on a simplified version of ours removing the proposed Lr and Lm (denoted as “Ours− ” in Table I). It can be observed that, while using standard MILs with multiple types of features in terms of BPLR descriptors (i.e., “Ours− ”) outperformed prior matching or weakly supervised based methods, disregard of the query adaptive term would still limit the resulting performance. For the completeness of comparisons, we also plot the ROC curves of different approaches in Figure 8, which again confirmed the superiority of our proposed method for retrieving relevant video frames across video frames give a query image of OOI.

Fig. 10. ROC curve of the search results in the TRECVID dataset.

Mean Average Precision

2) Discussions: In our experiments, we compare our proposed query-adaptive MIL with three matching-based image retrieval approaches: SIFT matching [28], spatial coding with SIFT [22], and BPLR matching with PHOG using Nearest Neighbor (NN) [42]. For weakly supervised methods, we consider the approach of [4] (i.e., MILboost with MSER (Maximally stable extremal regions [43])), and an examplebased model with Adaboost classifiers for learning object classes [5].

0.500

0.400

0.300

0.200

0

31

62

93

4 12

5 15

6

Number of Negative Training Frames

Fig. 11. Effects on the retrieval performance when increasing the

number of negative training frames for our method.

B. TRECVID Dataset 1) Descriptions: As a part of the instance search challenge of TREC Video Retrieval Evaluation since 2010, the TRECVID dataset contains four types of queries: person, character, object, and location. For the task of video instance retrieval in this paper, we consider 8 queries in the object category (and the associated 8 videos1 ) with duration varying from 25 to 60 minutes (see examples in Figure 9). It is worth repeating that, we only consider the query and its corresponding video when performing the retrieval. We note that, since only shot-level annotations are available for TRECVID data, we manually annotate the video frames in the shots containing the OOI, and obtain the associated time stamps as ground truth. Different from the commercial video dataset, we choose to uniformly downsample the video frames by 5 frames, and the number of shots per video varies from 250 to 450. The video resolution is 325 × 288 pixels, and we also resize the longest side of each query image to 100 pixel. Similarly, we select the four types of features for describing each extracted BPLR as the representation. 1

Video id: BG 36866 (9013), BG 38429 (9015), BG 37978 (9016), BG 36575 (9017), BG 38421 (9019), BG 38164 (9020), BG 36631 (9021), and BG 36021 (9022)

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 9

TABLE II P ERFORMANCE COMPARISONS ( IN TERMS OF MAP) ON THE TRECVID DATASET. SIFT + Spatial Code [22] X

BPLR + PHOG [32] X

MIL + MSER [4]

X (3+, 3−) / (3+, 6−) 0.321 0.286 0.066 0.065 0.038 0.020 0.020 0.061 0.110

0.309 0.282 0.130 0.058 0.036 0.019 0.021 0.085 0.118

0.561 0.184 0.005 0.056 0.021 0.020 0.016 0.006 0.109

0.304 0.323 0.086 0.246 0.106 0.043 0.055 0.316 0.185

/ / / / / / / / /

0.309 0.425 0.104 0.287 0.135 0.030 0.048 0.351 0.211

Boost + Exemplar [5]

Ours− (without Lr )

Ours

X

X

X (3+, 3−) / (3+, 6−) X (3+, 3−) / (3+, 6−) X (3+, 3−) / (3+, 6−) 0.288 0.211 0.006 0.058 0.043 0.015 0.029 0.245 0.112

/ / / / / / / / /

0.289 0.211 0.005 0.056 0.054 0.014 0.025 0.250 0.113

0.546 0.487 0.300 0.226 0.101 0.071 0.132 0.437 0.287

/ / / / / / / / /

0.541 0.564 0.560 0.319 0.179 0.083 0.168 0.433 0.356

0.650 0.611 0.573 0.602 0.157 0.193 0.189 0.450 0.428

/ / / / / / / / /

0.660 0.628 0.657 0.632 0.215 0.190 0.214 0.475 0.459

Weight

Query Image Training Frames IKEA Black suit Zebra line KLM Kappa UMBRO Tank Bus MAP

SIFT Matching [28] X

Weight

Lay’s

Weight

Starbucks

Weight

UMBRO

Zebra line

Features (a)

(b)

(c)

Fig. 12. Examples of the query images and the resulting feature weights observed by our q-MIL. (a) Query images, (b) selected frames

from the associated input video, and (c) weights determined for each feature attribute from four different types of features: PHOG (in red), color (in blue), LBP (in green), and SIFT (in black). Note that the X-axis is the index of the feature attributes, and the Y -axis indicates the corresponding weights.

2) Discussions: We compare our q-MIL with matching or weakly supervised based approaches, and list the average MAP scores in Table II (with 5 random trials). Compared with Table I, it can be seen that the MAP scores for the TRECVID dataset were generally lower than those for the commercial video dataset. This is because that the OOIs in the commercial video dataset are the commercial products, which are generally embedded in the video in a clear way (i.e., with less scale, occlusion, etc. variations). For the TREVCID data, not only the length of the videos is generally longer, more visual variations and complex backgrounds can be observed. Nevertheless, compared to other approaches, our proposed q-MIL outperformed all on the TRECVID dataset, which verifies the effectiveness and robustness of our method for retrieving relevant objects across video frames. Figure 10 shows the ROC curves for different approaches on this dataset, in which the curve of our method was above others. It is worth noting that, the difference between our method and the

simplified version (i.e. ours without the query adaptive term Lr ) was much larger than that in the commercial video dataset. This suggests that the use of our introduced query adaptive likelihood is particularly preferable for challenging videos in practical scenarios. Finally, we discuss the effects of different numbers of negative training frames for learning. For weakly supervised approaches including ours, one requires few positive and negative frames randomly selected from the input video for learning the associated classifier (our q-MIL further utilizes the query image). Since the number of negative frames is generally larger than that of positive frames, we further compare the performance using 6 negative frames for different weakly-supervised approaches, and the results are also listed in Table II. It can be seen that, having 6 negative frames for learning was able to provide improved MAP performance, which is due to the fact that using more negative training frames is able to reduce the number of possible false alarms. However, as shown in Figure 11, we observe that increasing

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 10

(a)

(a)

(b)

(b)

(c)

(d)

Fig. 14. Examples of unsuccessful instance retrieval results. Note

(c)

that red boxes denote the windows which have highest relevance scores in each video frame, and thus are considered as the OOI. (a) Bridgestone, (b) coke, (c) IKEA, and (d) bus.

(d)

Examples of successful instance retrieval results. Note that red boxes denote the windows which have highest relevance scores in each video frame, and thus are considered as the OOI. (a) Bridgestone, (b) coke, (c) IKEA, and (d) bus.

Fig. 13.

the number of negative frames for training (i.e., more than 6) did not remarkably increase the retrieval performance (all conducted with 5 random trails). Thus, we choose to report the results based on 3 positive frames and 6 negative frames in our experiments. C. Remarks As noted in Section III-B, we apply decision stumps with AnyBoost for learning the weak learners for our q-MIL to detect the OOI across video frames. Since the weak learners are observed for each feature attribute, the resulting weight for each weak learner can be obtained, and thus the resulting contribution of each feature can be visualized. To visualize the feature weights observed for each query image and the associated input video, we plot the weights of each feature attribute for selected examples in Figure 12. From this figure, we see that our q-MIL derived larger weights for several attributes of PHOG and color features for the OOI of Lay’s. As for the trademark of Starbucks, SIFT features were assigned larger weights due to the rich visual appearance of such an OOI. For the OOI of UMBRO, obvious textural patterns can be observed, and this explains why our q-MIL identified the significance of the LBP features from the learning process. Finally, we see that all four types of features made contributions to the learning of the pattern of zebra crossing. This is because that, the use of any single type of feature would not be easy to differentiate such patterns from the background. The above examples demonstrate that our proposed q-MIL is able to identify proper features (by selecting weak learners) for performing video instance retrieval. Figure 13 shows some examples of successful retrieval results using our method. It can be seen that our proposed q-MIL detected and selected proper window proposals across video frames as the OOI, even the OOI was presented in cluttered background, exhibited scale or illumination changes, or was partially occluded. However, for some scenarios as

shown in Figure 14, false positives were produced due to the detected windows shared very similar features (shape, texture or color) as the OOI did. Without additional prior knowledge such as video content, it would be very challenging for detection/classification based approaches to address such particular issues. Nevertheless, from the above experiments on two video datasets, our q-MIL was able to achieve the best retrieval results compared to recent matching or weaklysupervised learning based methods. Finally, we comment on the runtime estimates for our proposed method. The average computation time for training is 13.9 seconds per query (with three positive and negative frames), and the retrieval (testing) time is 0.5 seconds per frame. Note that all runtime estimates are performed by Matlab on a personal computer with Intel Core 2 Duo CPU 2.66 GHz and 4G RAM. We did not include the feature extraction time, since this process can be typically performed off-line for all retrieval methods. For matching-based methods (i.e., the first three approaches in Table II), only about 0.07 seconds were required for searching a frame (no training required). The training time for weakly supervised learning based approaches like [4] is about 5 seconds per query (they did not consider query adaptive terms like ours do). Although their testing time is about one to two orders faster than ours, this is due to the fact that their window proposals were either densely or randomly sampled from each frame, and thus they did not consider a query-adaptive selection process when detecting the OOIs. As noted in Tables I and II (or Figures 8 and 10), such methods did not produce satisfactory MAP or false positive rates as our did. V. C ONCLUSION In this paper, we presented a novel query-adaptive multiple instance learning (q-MIL) algorithm for video instance retrieval. Given a query image containing the OOI, together with a small number of randomly selected video frames (with only frame-level label information), our method is able to automatically identify the video frames containing the OOI even with visual appearance variations. We first introduced our frame-level feature representation with window proposals for describing the video frames. Inspired by MIL, our method considers each video frame as a bag, and the window proposals in it as the candidate regions of the OOI. Different from the

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2403236, IEEE Transactions on Image Processing 11

standard MIL, our q-MIL algorithm adapts the query input to the video content, and solves an MIL-based optimization problem for learning the OOI detector. In particular, we enforce an additional constraint in the proposed q-MIL formulation to observe the relationship between the query and positive frames. This is the reason why the visual appearance variations of the OOI can be properly handled during the learning process. Our experiments on two video datasets verified the effectiveness and robustness of our approach, which was shown to outperform existing video summarization or object matching methods for video instance retrieval. R EFERENCES [1] “https://www.youtube.com,” . [2] X.-S. Hua, T. Mei, and A. Hanjalic, Online multimedia advertising: techniques and technologies., IGI Global, 2011. [3] T.-C. Lin, J.-H. Kao, C.-T. Liu, C.-Y. Tsai, and Y.-C. F. Wang, “Video instance search for embedded marketing,” in Asia-Pacific Signal & Information Processing Association Annual Summit and Conference, 2012. [4] D. Liu, G. Hua, and T. Chen, “A hierarchical visual model for video object summarization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 12, pp. 2178–2190, 2010. [5] O. Chum and A. Zisserman, “An exemplar model for learning object classes,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2007. [6] M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in IEEE International Conference on Computer Vision, 2011. [7] T. Deselaers, B. Alexe, and V. Ferrari, “Localizing objects while learning their appearance,” in European Conference on Computer vision, 2010. [8] M. H. Nguyen, L. Torresani, F. De la Torre, and C. Rother, “Weakly supervised discriminative localization and classification: a joint learning process,” in IEEE International Conference on Computer Vision, 2009. [9] O. Russakovsky, Y. Lin, K. Yu, and L. Fei-Fei, “Object-centric spatial pooling for image classification,” in European Conference on Computer Vision, 2012. [10] Fei Li and Rujie Liu, “Multi-graph multi-instance learning for objectbased image and video retrieval,” in Proceedings of the 2nd ACM International Conference on Multimedia Retrieval. ACM, 2012, p. 35. [11] Minh Hoai, Lorenzo Torresani, Fernando De la Torre, and Carsten Rother, “Learning discriminative localization from weakly labeled data,” Pattern Recognition, vol. 47, no. 3, pp. 1523 – 1534, 2014. [12] C. Schmid, “Weakly supervised learning of visual models and its application to content-based retrieval,” International Journal of Computer Vision, vol. 56, no. 1-2, pp. 7–16, 2004. [13] J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” in IEEE International Conference on Computer Vision, 2003. [14] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2006. [15] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 591–606, 2009. [16] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2007. [17] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” in IEEE International Conference on Computer Vision, 2007. [18] Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang, “Spatial-bag-offeatures,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2010. [19] H. Xie, K. Gao, Y. Zhang, J. Li, and H. Ren, “Common visual pattern discovery via graph matching,” in ACM International Conference on Multimedia, 2011. [20] Y. Zhang, Z. Jia, and T. Chen, “Image retrieval with geometry-preserving visual phrases,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2011.

[21] Z. Wu, Q. Ke, M. Isard, and J. Sun, “Bundling features for large scale partial-duplicate web image search,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2009. [22] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for large scale partial-duplicate web image search,” in ACM International Conference on Multimedia, 2010. [23] R. Arandjelovi´c and A. Zisserman, “Three things everyone should know to improve object retrieval,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2012. [24] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 509–522, 2002. [25] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010. [26] Changhu Wang, Lei Zhang, and Hong-Jiang Zhang, “Graph-based multiple-instance learning for object-based image retrieval,” in Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, 2008. [27] O. Maron and T. Lozano-Perez, “A framework for multiple instance learning,” in Advances in Neural Information Processing Systems, 1998. [28] D. G. Lowe, “Object recognition from local scale-invariant features,” in IEEE International Conference on Computer Vision, 1999. [29] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008. [30] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Workshop on Statistical Learning in Computer Vision, European conference on Computer vision, 2004. [31] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchical graph-based video segmentation,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2010. [32] J. Kim and K. Grauman, “Boundary preserving dense local regions,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2011. [33] A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a spatial pyramid kernel,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2007. [34] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002. [35] P. Arbelaez, M. Marie, C. Fowlkes, and J. Malik, “From contours to regions: An empirical evaluation,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2009. [36] W. Iba, , and P. Langley, “Induction of one-level decision trees,” in International Conference on Machine Learning, 1992. [37] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers Inc., 1988. [38] L. Mason, J. Baxter, P. Bartlett, and M. Frean, “Boosting algorithms as gradient descent,” in Advances in Neural Information Processing Systems, 2000. [39] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: a statistical view of boosting,” Annals of Statistics, vol. 28, pp. 2000, 1998. [40] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, and X. Huang, “Global contrast based salient region detection,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2009. [41] A. Vedaldi and B. Fulkerson, “VLFeat: An Open and Portable Library of Computer Vision Algorithms,” http://www.vlfeat.org/, 2008. [42] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearestneighbor based image classification,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2008. [43] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in British Machine Vision Conference, 2002.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Query-adaptive multiple instance learning for video instance retrieval.

Given a query image containing the object of interest (OOI), we propose a novel learning framework for retrieving relevant frames from the input video...
13MB Sizes 0 Downloads 9 Views