sensors Article

An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition Sungbaek Yoon 1 , Hyunjin Park 1 and Juneho Yi 1,2, * 1 2

*

School of Electronic and Electrical Engineering, Sungkyunkwan University, Suwon 16419, Korea; [email protected] (S.Y.); [email protected] (H.P.) School of Information and Communication Engineering, North University of China, Taiyuan 03000, China; [email protected] Correspondence: [email protected]; Tel.: +82-031-290-7142

Academic Editor: Vittorio M.N. Passaro Received: 8 March 2016; Accepted: 23 June 2016; Published: 25 June 2016

Abstract: This research features object recognition that exploits the context of object-action interaction to enhance the recognition performance. Since objects have specific usages, and human actions corresponding to these usages can be associated with these objects, human actions can provide effective information for object recognition. When objects from different categories have similar appearances, the human action associated with each object can be very effective in resolving ambiguities related to recognizing these objects. We propose an efficient method that integrates human interaction with objects into a form of object recognition. We represent human actions by concatenating poselet vectors computed from key frames and learn the probabilities of objects and actions using random forest and multi-class AdaBoost algorithms. Our experimental results show that poselet representation of human actions is quite effective in integrating human action information into object recognition. Keywords: object recognition; object-action context; object-human interaction

1. Introduction Object recognition is difficult due to a variety of factors, including viewpoint variation, illumination changes, occlusion, etc. However, before encountering these factors, the inherent difficulty of object recognition lies in the fact that there is a large amount of intra-category appearance variation, and objects from different categories may have similar appearances. In order to improve the performance of object recognition, researchers have exploited contextual information that includes spatial [1–3], semantic [4–7], and scale [8,9] contexts. Spatial context refers to information about the potential locations of objects in images or the positional relationship between objects. Semantic context provides clues related to the co-occurrence of objects with other objects in a scene. Scale context gives the relative scale of objects in a scene. In this work, we focus on the context of object-action interaction, which has been relatively unexplored. Since objects have specific usages, and human actions corresponding to these usages can be related to these objects, it is possible to improve the performance of object recognition by exploiting human interactions with objects as a type of context information. Especially, when objects from different categories have similar appearances, analyzing the human action associated with each object can be effective in resolving the ambiguity related to recognizing objects. As illustrated in Figure 1, when a cup or spray bottle is held by a human hand, they look very similar because of their cylindrical structures. In this case, exploiting the context of the object-action interaction greatly facilitates the distinction between the two objects.

Sensors 2016, 16, 981; doi:10.3390/s16070981

www.mdpi.com/journal/sensors

Sensors 2016, 16, 981 Sensors 2016, 16, 981

2 of 12 2 of 13

Figure 1. Examples of of object-action Objectshave havespecific specific usages human actions Figure 1. Examples object-action context. context. Objects usages andand human actions corresponding to these usages can beberelated objects. corresponding to these usages can relatedto tothese these objects.

There have been a few experiments that have adopted similar ideas with different

There have been a few experiments that have adopted similar ideas with different representations representations of human actions, objects, and computational algorithms. Moore et al. [10] depicted of human actions, computational algorithms. et hand al. [10] depicted humanitactions human actions objects, using theand hidden Markov model (HMM) byMoore tracking locations, although is usingnot theeasy hidden Markov model (HMM) tracking hand locations, although not easy to normalize to normalize different action by speeds for different individuals. Guptaitetisal. [11] recognized different action speeds for different Gupta of et action al. [11]recognition recognizedand human-object interactions human-object interactions basedindividuals. on the integration object recognition, basedwhere on the integration action recognition object recognition, human actionsof and objects contributeand mutual contexts for each where other. human actions and objects also contexts represented human actions using HMM by detecting hand trajectories. Human contributeThey mutual for each other. actions were segmentedhuman into several atomic actions; stable segmentation of each action They also represented actions using HMMhowever, by detecting hand trajectories. Human actions into atomic actions is difficult. Yao et al. [12] modeled the context between human poses and objects were segmented into several atomic actions; however, stable segmentation of each action into atomic using Markov random field modeling to recognize human-object interactions. Their work is based actions is difficult. Yao et al. [12] modeled the context between human poses and objects using Markov on a single pose, and it is not clear which pose belongs to which action. Thus, categories of objects random field modeling to recognize human-object interactions. Their work is based on a single pose, may be relatively obscured compared to when action information is employed. Grabner et al. [13] and itdescribed is not clear which pose belongs to which action.pose Thus, categories of objects may of bethem. relatively the relations between objects and a human based on matching the shapes obscured compared to when action information is employed. Grabner et al. [13] described the relations They exploited the relations to detect an affordance which is functionality implied by objects rather between and aa human on matching the shapes of them. They exploited the relations thanobjects recognizing specific pose objectbased category. to detect Alternatively, an affordancedeep which is functionality by objects rather than recognizing specific learning approaches,implied such as convolutional neural networks (CNN) a[14], achieved great success in object recognition. However, as can be seen in the experimental objecthave category. results, when there not enough labelled such images the recognition performance is not as have Alternatively, deepare learning approaches, asavailable, convolutional neural networks (CNN) [14], high as expected. In addition, it is difficult to find an optimal CNN architecture for a given achieved great success in object recognition. However, as can be seen in the experimental results, when problem. there are not enough labelled images available, the recognition performance is not as high as expected. The goal of this study is to efficiently and effectively incorporate human action information In addition, it is difficult to find an optimal CNN architecture for a given problem. into object recognition in order to boost the recognition performance. We employ a few image The goalthat of this study is to efficiently incorporate information frames contain key poses, whichand caneffectively be used to distinguishhuman humanaction actions. Since an into objectassemblage recognition order to boost theadvantage recognition We employ few imagebody frames ofin key poses can take of performance. the fiducial appearance of athe human in that contain key representation poses, which can be used to distinguish human actions. Since anisassemblage of key action, of human actions by concatenating a few key poses quite effective. Theposes mainadvantage contribution work is the establishment of an effective Bayesian that exploits can take of of thethis fiducial appearance of the human body in action,approach representation of human the probabilities of objects and actions, through random forest and multi-class AdaBoost actions by concatenating a few key poses is quite effective. The main contribution of this work is the algorithms.of an effective Bayesian approach that exploits the probabilities of objects and actions, establishment Figure 2forest overviews our method,AdaBoost which recognizes objects using object-action context. First, through random and multi-class algorithms. random forests for objects and actions are trained independently using object features obtained Figure 2 overviews our method, which recognizes objects using object-action context. from object images and action features acquired from video sequences. Additionally, by regarding First, random forests for objects and actions are trained independently using object features obtained each tree in a random forest as a weak classifier, the weight of the tree is determined using from multi-class object images and action features Additionally, by regarding AdaBoost [15]. The object acquired categoriesfrom of thevideo inputsequences. data are determined by applying a each tree in a random forest as a weak classifier, the weight of the tree is determined using multi-class Bayesian approach using the probabilities calculated from object features and action features. We AdaBoost [15].human The object categories of the input datavectors are determined by applying Bayesian represent actions by concatenating poselet [16,17] computed fromakey frames approach in a poselets depicting localfrom partsobject of human posesand areaction featurefeatures. vectors that are strictlyhuman clustered usingvideo. the probabilities calculated features We represent actions based on theirposelet appearance. The[16,17] value ofcomputed an elementfrom in a key poselet vectorinisathe maximum response by concatenating vectors frames video. poselets depicting local parts of human poses are feature vectors that are strictly clustered based on their appearance. The value of an element in a poselet vector is the maximum response value of the key frame to each poselet; we use a support vector machine (SVM) as the poselet classifier. Recently, with the resurgence

Sensors 2016, 2016, 16, 16, 981 981 Sensors

of 12 13 33 of

value of the key frame to each poselet; we use a support vector machine (SVM) as the poselet of the neural network, poselets have a new based on theposelets neural network [18]. version However, the classifier. Recently, with the resurgence of version the neural network, have a new based neural network-based approach is more computationally expensive than our random forest-based on the neural network [18]. However, the neural network-based approach is more computationally method and alsoour requires many more training images to produce well-trained network. We use expensive than random forest-based method and also requiresa many more training images to the histogram of oriented gradients (HOG) to represent objects. The experimental results show that produce a well-trained network. We use the histogram of oriented gradients (HOG) to represent our method, using object-action enhances the performance of object context, recognition when the objects. The experimental resultscontext, show that our method, using object-action enhances appearances towhen the same largely differ and objects of performanceof ofobjects object belonging recognition the category appearances of objects belonging to different the samecategories category are similar in appearance. largely differ and objects of different categories are similar in appearance.

Figure 2. Object using object-action object-action context. context. Figure 2. Object recognition recognition using

This paper is organized as follows. The following section presents the probabilistic model we This paper is organized as follows. The following section presents the probabilistic model we propose for object recognition and describes our approach for determining the probabilities of propose for object recognition and describes our approach for determining the probabilities of objects objects and human actions using random forest and multi-class AdaBoost algorithms. The methods and human actions using random forest and multi-class AdaBoost algorithms. The methods used for used for representing objects and actions are given in Section 3, and our experimental results are representing objects and actions are given in Section 3, and our experimental results are reported in reported in Section 4. Section 4. 2. Incorporating 2. IncorporatingObject-Action Object-ActionContext Contextinto intoObject ObjectRecognition Recognition

A denote Oand and object categories and human action categories, respectively. x O is an A denote object categories and human action categories, respectively. xO is an appearance O O and A , the appearance from and ofx aA human is a feature a human the xobject. feature from feature an object, andan x A object, is a feature actionof related to theaction object.related Given xto ` ˇ O A˘ O A Given x and , thecategory, probability  , can be depicted by pEquation O | x O , x A (1): probability of thex object p Oˇof x ,the x ,object can becategory, depicted by Equation (1): ` ˇ ˘ ` ˇ ˘ ` ˇ ˘ p Oˇ xO , x A “ p Oˇ xO p Oˇ x A ` ˇ OO ˘ ř ` A ˇ A ˘ p O | x O , x A “ pp O O|ˇxx p Op| xO, Aˇx (1) ˇ O ˘ Ař p O , A | x A ` ˇ A ˘ ` “ pp OO| ˇxxO p pO|Aq p Aˇx (1) AA



    p O | x

      ` ˇpO | A˘p A | x  Our method outputs the object that maximizes p OˇxO , x A as the recognition result. The goal of O

A

A ` ˇ ˘ this method is to efficiently learn the probability of the object category p OˇxO given an object feature O A ˇ ` ˘ Our method outputs the object that maximizes O | x , x  as theA recognition result. The xO , the probability of the action category p Aˇx A givenpan action feature x , and p pO|Aq. goal of this method is to efficiently learn the probability of the object category pO | x O  given an

Sensors 2016, 16, 981

4 of 13

object feature x O , the probability of the action category pA | x A  given an action feature x A , 4 of 12 and p(O | A) .

Sensors 2016, 16, 981

We first describe how to estimate `pˇO | x˘ O  and` ˇpA˘| x A  . We employ a random forest to We first describe how to estimate p OˇxO and p Aˇx A . We employ a random forest to learn learn ˇ A ˘ pA | x A  . Figure 3 depicts the process used to calculate the probability of the ` ˇ Op˘O | x O ` and p Oˇx and p Aˇx . Figure 3 depicts the process used to calculate the probability of the object object categories. The probability of object category, Pj , is a weighted summation of the categories. The probability of object category, Pj , is a weighted summation of the probabilities of the Pθobtained j probabilities of the categories, , which are obtained from trees in The the random pjq, object categories, Pθobject which are from trees in the random forest. weights forest. of the i =1,,n i“1,...,n trees, αθO , are trained by multi-class Adaboost. i“1,...,n of the trees,  O The weights , are trained by multi-class Adaboost. 

()

i 1,,n

Figure 3. Computing the probabilities of object categories using a random forest and Figure 3. Computing the probabilities of object categories using a random forest and multi-class multi-class Adaboost. Adaboost.

The probability object category is asisfollows. First,First, givengiven training data, The training trainingprocess processfor forthe the probability of an category as follows. training ! o of an )object oj j o o o n D “ tpxD1=, o1xqo,1.,.o. , ,px x j “ ox , . . .o, x M ,oaj random forest of k trees, θ1 , . . . , θk , is generated , ,xoonnqu, oand i, x data, and x j1 = x1 j , n M i , a random forest of k trees, θ1 , , θk , is o j1 th from the data. xi represents the i object feature belonging to object category, o j . The probability of o j o th xi j represents generated from the data. the iin object feature that is computed by the random forest is given Equation (2): belonging to object category, o j .

{(

) (

)}

{

}

The probability of o j that is computed by the random forest is given in Equation (2): k ´ ˇ ¯ ´ ˇ ¯ no ,θO 1 ÿ j ˇ ˇ p o j ˇxO ” p o j ˇxO , ΘO “ ˇˇ ˇˇk αθOn Oi O i n o O 1 j ,θ i 1 O i p o j | x O  p o j | x O ,  O  ˇΘ ˇ i“ 



 



O

 i 1

(2)

(2)

n O

i

i

Here,ˇ ΘOˇ is the random forest built from the object features where θiO P Θ denotes the ith decision O th Here, ˇ ˇ is the random forest built from the object features where  O   denotes the i tree and ˇΘO ˇ “ k. no ,θO represents the amount of training data, which is iclassified as the object j i decision tree and nO iskthe . ntotal represents the amount oftree training which is By classified the o , O amount category, o . Finally, of training data in θ O at data, the leaf node. treatingaseach j

θiO

j

i

i

tree in category, the randomo forest as a weak theamount weight of of training each tree,data αθO ,in is tree learned O using multi-class n O classifier, object is the total at the leaf node. i i ` j ˇ. Finally, ˘ i A ˇ Adaboost [15]. p A x is determined in the exact same way as above by using action features. By treating each tree in in thethe random forest as a weak classifier, the weight of each tree,   Oand , is For splitting nodes trees of the random forests, two parameters, ‘MinParentSize’ i

‘MinLeafSize’, are defined.Adaboost ‘MinParentSize’ denote number samples in learned using multi-class [15]. pAand in thethe exact same of way as above  is determined | x A‘MinLeafSize’ abynode and the number of samples in a leaf node, respectively. We have set ‘MinParentSize’ to 20 and using action features. ‘MinLeafSize’ to 10. A tree splitting when any of the following conditions hold: (1) if a node For splitting nodes in stops the trees of the random forests, two parameters, ‘MinParentSize’ and contains only samples of one class; (2) the number of samples is fewer than ‘MinParentSize’ samples ‘MinLeafSize’, are defined. ‘MinParentSize’ and ‘MinLeafSize’ denote the number of samples inina anode node; and (3)number any splitofapplied toin a node children with ‘MinLeafSize’tosamples. and the samples a leafgenerates node, respectively. We smaller have setthan ‘MinParentSize’ 20 and Figure 4 describes the learning process of multi-class Adaboost, which is an extension the ‘MinLeafSize’ to 10. A tree stops splitting when any of the following conditions hold: (1) if aofnode binary Adaboost learning process into multi-classes. It generates classification rules and readjusts the contains only samples of one class; (2) the number of samples is fewer than ‘MinParentSize’ samples distribution of the training data using the preceding classification rules. When the amount of training in a node; and (3) any split applied to a node generates children with smaller than ‘MinLeafSize’ data is n and C is the number of categories, the initial distribution of the data is computed in the first samples. step. Figure During4k describes repetitions,the w is updatedprocess and data are not well-classified are assigned higher values. learning of that multi-class Adaboost, which is an extension of the p m q In the second step,learning the error process of the weak T pxq, is computed and w is renewed based on the binary Adaboost intoclassifier, multi-classes. It generates classification rules and readjusts error. Lastly, we acquire the probability of an object category classification as a linear combination of the the distribution of the training data using the preceding rules. When theprobabilities amount of obtained from the trees that are weak classifiers; this is done using the weight α. In Step 2c, the extra term, log pC ´ 1q, represents the only variation from the binary Adaboost algorithm. Unlike in binary

training data is n and C is the number of categories, the initial distribution of the data is computed in the first step. During k repetitions, w is updated and data that are not well-classified are assigned higher values. In the second step, the error of the weak classifier, ( ) T m (x ) , is computed and w is renewed based on the error. Lastly, we acquire the probability of Sensors 2016,category 16, 981 5 of 12 an object as a linear combination of the probabilities obtained from the trees that are weak classifiers; this is done using the weight α . In Step 2c, the extra term, log(C - 1) , represents the only variation fromwhere the binary Adaboost algorithm. Unlike in is binary classification, where the error rate of classification, the error rate of random guessing 1{2, the error rate of random guessing is ( ) random guessing is 1 2 , the error The rate Adaboost of randomassumption, guessing is which C - 1 expects C for multi-classification. pC ´ 1q {C for multi-classification. that the error rate ofThe the Adaboost assumption, which that the Thus, error rate of the is lessofthan 1 2 , is not weak classifier is less than 1{2,expects is not satisfied. in order to weak solve classifier this drawback Adaboost, the pC ´ 1qThus, termin is order added. log satisfied. to solve this drawback of Adaboost, the log(C - 1) term is added.

Figure 4. Training the weight of each tree in the random forest using multi-class Adaboost. Figure 4. Training the weight of each tree in the random forest using multi-class Adaboost.

To thethe Bayesian rule: (O | A) we ) using To estimate estimate p ppO|Aq, , weuse usep pA|Oq Bayesian rule: p(A | Ousing pO|Aq ppO | A “ 



    O p  A | O  p O  p pA|Oq p pOq řp A | O p O p pA|Oq p pOq

(3) (3)

O

where p pA|Oq can be calculated based on the number of observations associated with the same where p(A | O) can be calculated based on the number of observations associated with the same object category: ˇ ` ˘ nj object category: (4) p A “ a j ˇO “ oi “ nNj i p A = a j | O = oi = Here, N is the number of observations associated with object category, o , and n represents (4) the i

(

)

Ni

i

j

number of observations for action category, a j . In our experiments, training image sequences are Here,such is each the number of observations associated to with object category, nj N i that i , and collected subject takes action that corresponds the correct usage of a ogiven object. ˇ ` ˘ represents the number of observations experiments, training image Thus, in actual implementation, p A “for a j ˇaction O “ oi category, “ 1 for ia “ j; 0our otherwise. Here, i “ j means j . In an object and correct such actionthat pair. sequences areits collected each subject takes action that corresponds to the correct usage of a





given object. Thus, in actual implementation, p A  a j | O  oi  1 for i  j ; 0 otherwise. Here, 3. Representing Objects and Human Actions i  j means an object and its correct action pair. We can regard human actions as an assemblage of continuous poses. However, on account of the similarity between the poses in adjacent frames, singling poses out from all of the video frames creates needless duplication. Thus, we extracted the key frames from the video in order to use the minimum number of poses to express human actions. We then deployed poselet vectors to represent the key frames.

We can can regard regard human human actions actions as as an an assemblage assemblage of of continuous continuous poses. poses. However, However, on on account account of of We the similarity similarity between between the the poses poses in in adjacent adjacent frames, frames, singling singling poses poses out out from from all all of of the the video video frames frames the creates needless duplication. Thus, we extracted the key frames from the video in order to use the creates needless duplication. Thus, we extracted the key frames from the video in order to use the minimum number of poses to express human actions. We then deployed poselet vectors to minimum number of poses to express human actions. We then deployed poselet vectors to Sensors 2016, 16, 981 6 of 12 represent the the key key frames. frames. represent Figure 55 shows shows the the procedure procedure used used for for turning turning key key frames frames into into poselet poselet vectors. vectors. To To describe describe the the Figure key frames frames using using poselet poselet vectors, vectors, the the labeled labeled poselets poselets shown shown in in Figure Figure 66 are are expressed by by HOG HOG [19], [19], key Figure 5 shows the procedure used for turning key frames into poselet expressed vectors. To describe the and an an SVM SVM is is learned learned for for each each poselet. poselet. A A poselet poselet vector vector is is generated generated using using the the maximum maximum response response and key frames using poselet vectors, the labeled poselets shown in Figure 6 are expressed by HOG [19], values, which which are obtained obtained by by applying applying all all of of the the learned learned poselet SVMs SVMs to to aa key key frame frame through through aa values, and an SVM isare learned for each poselet. A poselet vector is poselet generated using the maximum response sliding window window technique. An An action action feature feature is is then then obtained obtained by by concatenation concatenation of of the the poselet poselet sliding values, which are technique. obtained by applying all of the learned poselet SVMs to a key frame through a sliding vectors. vectors. window technique. An action feature is then obtained by concatenation of the poselet vectors.

Figure 5. Creation of poselet vectors and action features. Figure Figure 5. 5. Creation Creation of of poselet poselet vectors vectors and and action action features. features.

Figure 6. Examples of poselets. Figure 6. 6. Examples Examples of of poselets. poselets. Figure

To extract video, a poselet vector is computed for each frameframe of theof input To extractkey keyframes framesfrom frominput input video, poselet vector is computed computed for each each the To extract key frames from input video, aa poselet vector is for frame of the video and the Euclidean distance between the frames (at a similar time as the training key frames) input video and the Euclidean distance between the frames (at a similar time as the training key input video and the Euclidean distance between the frames (at a similar time as the training key and the training key frames computed using their poselet vectors. frames with the minimum frames) and the the training training keyisframes frames is computed computed using their poseletThe vectors. The frames with the the frames) and key is using their poselet vectors. The frames with distance are selected as the key frames of the input video. Objects are represented using HOG. The size minimum distance distance are are selected selected as as the the key key frames frames of of the the input input video. video. Objects Objects are are represented represented using using minimum of an object image is object 50 ˆ 50. 50 × 50 HOG. The size of an image is . HOG. The size of an object image is 50 ×50 . 4. Experimental Results 4. Experimental Experimental Results Results 4. We compared the performance of our method with that of the one proposed by Gupta et al. [11] and a CNN. To our knowledge, the algorithm of Gupta et al. [11] is the most representative work that exploits human actions as context information for object recognition. We have included a CNN for performance comparison because CNN has recently achieved great success in object recognition. We have also conducted an experiment using local space-time action features. To represent actions, we have used Bag of Visual words (BoV) model of local N-jets [20–22], which are built from

Sensors 2016, 16, 981

7 of 12

space-time interest points (STIP) [21,22]. Local N-jets is one of the popular and strong motion features and its two first levels show velocity and acceleration. The code book for BoV is constructed using a K-means algorithm. For our experiments, we designed a CNN architecture by referring to CIFAR10-demo [23]. As described in Table 1, the network contains 13 layers. The outputs of the first, second, and third convolutional layers are conveyed to the rectified linear unit (ReLU) and pooling layers. The first pooling layer is the max pooling layer and the remaining pooling layers are average pooling layers. The fourth convolutional layer and two fully-connected layers are linked to one another without interrupting the ReLU and pooling layers. The last fully-connected layer feeds its output to softmax. Table 1. The CNN architecture used for the experiments.

Layer1 Layer2 Layer3 Layer4 Layer5 Layer6 Layer7 Layer8 Layer9 Layer10 Layer11 Layer12 Layer13

Operation

Input Size

Filter Size

Conv Max ReLU Conv ReLU Avg Conv ReLU Avg Conv fully-connected fully-connected Softmax

50 ˆ 50 ˆ 3 50 ˆ 50 ˆ 32 25 ˆ 25 ˆ 32 25 ˆ 25 ˆ 32 25 ˆ 25 ˆ 32 25 ˆ 25 ˆ 32 12 ˆ 12 ˆ 32 12 ˆ 12 ˆ 64 12 ˆ 12 ˆ 64 6 ˆ 6 ˆ 64 3 ˆ 3 ˆ 64 1 ˆ 1 ˆ 64 1ˆ1ˆ4

5 ˆ 5 ˆ 3 ˆ 32

Pool

Stride

Output Size

3ˆ3

1 2

50 ˆ 50 ˆ 32 25 ˆ 25 ˆ 32 25 ˆ 25 ˆ 32 25 ˆ 25 ˆ 32 25 ˆ 25 ˆ 32 12 ˆ 12 ˆ 32 12 ˆ 12 ˆ 64 12 ˆ 12 ˆ 64 6 ˆ 6 ˆ 64 3 ˆ 3 ˆ 64 1 ˆ 1 ˆ 64 1ˆ1ˆ4

5 ˆ 5 ˆ 32 ˆ 32

1 3ˆ3

2 1

3ˆ3

2 1 1 1

5 ˆ 5 ˆ 32 ˆ 64

4 ˆ 4 ˆ 64 ˆ 64 3 ˆ 3 ˆ 64 ˆ 64 1 ˆ 1 ˆ 64 ˆ 4

For the experiments, we captured videos of 19 subjects performing four kinds of actions with four different objects (i.e., cups, scissors, phones, and spray bottles). Each of the subjects carried out actions using these objects. We constructed a dataset that contains 228 video sequences [24]. We extracted key frames from the video sequences in order to use the minimum number of poses to express human actions and deployed poselet vectors to represent the key frames. An action feature is represented as a concatenation of three poselet vectors. We used 38 kinds of poselets in this experiment. Thus, an action feature has 114 dimensions, to learn poselet SVMs, we used 20,308 positive images for 38 different poses and 2321 negative images. The size of a poselet training images is 96 ˆ 94. A linear SVM is used to differentiate samples in a single poselet category from samples belonging to all of the remaining poselet categories. In order to obtain more positive action data for the random forest and SVM, we used combinations of the frames adjacent to the key frames. As a result, to train the action random forest, we used the following amounts of action features: 1625 action features for the “drinking water” action, 7149 for “calling on the phone”, 1674 for “cutting paper”, and 678 for “spraying”. For training the multi-class AdaBoost, we used 848 action features for “drinking water”, 1890 for “calling on the phone”, 330 for “cutting paper”, and 658 for “spraying”. The object images used in the experiments were obtained from Google Image Search [25] and ImageNet [26]. We collected 3120 cup images, 4131 phone images, 2263 scissors images, and 2006 spray bottle images. We used 1200 images from each category to train the object random forest and 600 images for training the multi-class AdaBoost. Figure 7 shows some of the object images that were used in our experiments. The object image set contains objects that have a variety of appearances within the same category. Some objects, such as cups and sprays, are similar in appearance due to their cylindrical structure; however, these objects belong to different categories.

Sensors 2016, 16, 981

8 of 13

600 images for training the multi-class AdaBoost. Figure 7 shows some of the object images that were used in our experiments. The object image set contains objects that have a variety of Sensorsappearances 2016, 16, 981 within the same category. Some objects, such as cups and sprays, are similar in8 of 12 appearance due to their cylindrical structure; however, these objects belong to different categories.

Figure 7. Some of the object images used in our experiments. Figure 7. Some of the object images used in our experiments.

We conducted experiments with random forests using 100, 150, 200, 250, and 300 trees. Figure

We conducted experiments with random forests using 100, 150, recognition. 200, 250, and 300 trees. Figure 8 8 shows the confusion matrices, which describe the results of object The first column represents the results of object recognition using object features and thefirst second shows the confusion matrices, which describe theonly results ofappearance object recognition. The column column depicts the results of object recognition using both object appearances and human actions. represents the results of object recognition using only object appearance features and the second column As the expected, see improved object recognition when using the and human actions. Overall, the depicts results we of object recognition using both object appearances human actions. As expected, recognition rate is improved by between 4% (scissors) and 30% (phone), as compared to when only we see improved object recognition when using the human actions. Overall, the recognition rate is object appearances are used. The number of trees has little influence on the performance of object improved by between 4% (scissors) and 30% (phone), as compared to when only object appearances recognition in the experiments, both with and without human action context. are used. The number of trees has little influence on the performance of object recognition in the experiments, with and without human action context. Sensors 2016,both 16, 981 9 of 13

Figure 8. The results of object recognition: the left column shows the results using only the object’s Figure 8. The results of object recognition: the left column shows the results using only the appearance and the right column represents the results using the object’s appearance and human object’s appearance and the right column represents the results using the object’s appearance and actions. human actions.

Figure 9 shows the result of object recognition in which actions are represented by the BoV of N-jets. For training the action random forest, we have used 39 action features for the “drinking water” action, 39 for “calling phone”, 39 for “cutting paper”, and 40 for “spraying”, respectively. For training the multi-class AdaBoost, the action features employed in the random forest are also utilized. For testing, we have used 18 action features for the “drinking water” action, 18 for “calling phone”, 18 for “cutting paper”, and 17 for “spraying”, respectively. We have used 1200 images from each object category to train the object random forest and 600 images for training the

Sensors 2016, 16, 981

9 of 12

Figure 9 shows the result of object recognition in which actions are represented by the BoV of N-jets. For training the action random forest, we have used 39 action features for the “drinking water” action, 39 for “calling phone”, 39 for “cutting paper”, and 40 for “spraying”, respectively. For training the multi-class AdaBoost, the action features employed in the random forest are also utilized. For testing, we have used 18 action features for the “drinking water” action, 18 for “calling phone”, 18 for “cutting paper”, and 17 for “spraying”, respectively. We have used 1200 images from each object category to train the object random forest and 600 images for training the multi-class AdaBoost. For testing, we have used 18 object features for “cup”, 18 for “phone”, 18 for “scissors”, and 17 for “spray”, respectively. Sensors 2016, 16, 981 10 of 13

Figure 9. The results of object recognitionusing using the the BoV The first column shows the the Figure 9. The results of object recognition BoV of ofLocal LocalN-jets: N-jets: The first column shows results using only the object’s appearance and the second column represents the results using the results using only the object’s appearance and the second column represents the results using the object’s appearance and human actions. object’s appearance and human actions.

Figure 10 shows the results of applying Gupta’s algorithm to our experimental data. With the

Except forofspray we have that the performance of with object recognition is also exception cups, bottles, objects exhibit lowobserved recognition performance compared our method. The differences of recognition rates between our method and their method were 28%–31% for significantly improved when using BoV of local N-jets as action features. The improvement of the telephones, scissors, and 7%–9% for spray bottles. that performance recognition rate 21%–31% achievedfor ranges from 50% (phone) to 6% (cup).We Asobserved described inthis Figures 8 and 9, the difference is caused mainly by their representation of human actions with incorrectly segmented poselet representations of the actions show better performances when recognizing cups, phones, and atomic actions. From the experimental results, we see that our poselet representation of human spray bottles. The differences of recognition rates between the action features were 6%–22% for cups, actions, using a simple graphical model, is more effective at integrating human action information 1%–4% for phone, and 3%–16% for spray bottles. On the other hand, the recognition of the scissorss is into object recognition. improved from 3% to 12% using the BoV of local N-jets.

Figure 9. The results of object recognition using the BoV of Local N-jets: The first column shows the Sensors 2016, 16, 981 results using only the object’s appearance and the second column represents the results using the object’s appearance and human actions.

10 of 12

Figure 10 shows the results of applying Gupta’s algorithm to our experimental data. With Figure 10 shows the results of applying Gupta’s algorithm to our experimental data. With the the exception cups,objects objectsexhibit exhibit recognition performance compared our The method. exception of of cups, lowlow recognition performance compared with ourwith method. The differences of recognition recognitionrates ratesbetween betweenourour method their method 28%–31% differences of method andand their method werewere 28%–31% for for telephones, 21%–31% forfor scissors, and bottles.We Weobserved observed performance telephones, 21%–31% scissors, and7%–9% 7%–9% for for spray spray bottles. thatthat thisthis performance difference is caused mainly theirrepresentation representation of human incorrectly segmented difference is caused mainly byby their humanactions actionswith with incorrectly segmented atomic actions. From the experimental results, we see that our poselet representation of human atomic actions. From the experimental results, we see that our poselet representation of human actions, a simple graphical model, is more effective integratinghuman humanaction action information information into actions, usingusing a simple graphical model, is more effective at at integrating into object recognition. object recognition.

Sensors 2016, 16, 981 Figure Sensors 2016, 16, 981

10. The results of Gupta’s algorithm for our experimental data.

11 of 13 11 of 13

Figure 10. The results of Gupta’s algorithm for our experimental data. Figure 10. The results of Gupta’s algorithm for our experimental data.

The results of applying the CNN to our experimental data are shown in Figure 11. To train the The results of applying the CNN to our experimental data are shown in Figure 11. To train theIt can CNN, we The used the same number images for experimental each category as are wasshown used in in Figure our method results of applying theofCNN to our data 11. To (1800). train the CNN, we used the same number of images for each category as was used in our method (1800). It be seen that recognition performance of for oureach method outperforms The performance CNN, wethe used the same number of images category as was usedthe in CNN. our method (1800). It can be seen that the recognition performance of our method outperforms the CNN. The performance can be seen over that the recognition performance of 34%–37% our methodfor outperforms the CNN. The for performance improvements CNN were 11% for cups, telephones, 3%–10% scissors, and improvements over CNN were 11% for cups, 34%–37% for telephones, 3%–10% for scissors, and improvements over CNN were for 11%a for cups,performance 34%–37% for comparison, telephones, 3%–10% for scissors,Figure and 4%–6% for spray bottles. To allow clearer we also included 4%–6% for spray bottles. To allow for a clearer performance comparison, we also included Figure 12. 4%–6% for spray bottles. To allow for a clearer performance comparison, we also included Figure We observed that 1800 labeled imagesimages for each are not to adequately train 12. We observed that 1800 labeled for category each category are enough not enough to adequately trainthe theCNN 12. We observed that 1800 labeled images for each categoryby areour notmethod. enough to adequately the and guarantee performance than what was obtained Moreover, it train is difficult CNN and better guarantee better performance than what was obtained by our method. Moreover, it is to CNN and guarantee better performance than what was obtained by our method. Moreover, it is find the optimal CNN the givenfor problem. difficult to find the architecture optimal CNNfor architecture the given problem. difficult to find the optimal CNN architecture for the given problem.

Figure 11. The results of applying the the CNN totoour experimental data. Figure 11.11. The results CNNto ourexperimental experimental data. Figure The resultsofofapplying applying the CNN our data.

Figure 12. The performance comparison of our methods using poselet vectors and local N-jets with

Figure 12. The performance comparisonof ofour our methods methods using vectors andand locallocal N-jets with with Figure 12. The performance comparison usingposelet poselet vectors N-jets Gupta’s algorithm and CNN. Gupta’s algorithm and CNN. Gupta’s algorithm and CNN. Cups and spray bottles look similar to each other, especially when they are held in a human Cups and spray bottles look similar to each other, especially when they are held in a human hand, because of their cylindrical structure. Even some phones, such as cordless home phones, have hand, because of their cylindrical structure. Even some phones, such as cordless home phones, have appearances that are similar to cups and spray bottles in the feature space (due to their rectangular appearances that are similar to cups and spray bottles in the feature space (due to their rectangular form). From the experimental results, we confirmed that our method greatly facilitates distinction form). From the experimental results, we confirmed that our method greatly facilitates distinction between similar looking objects from different categories by efficiently exploiting the action between similar looking objects from different categories by efficiently exploiting the action information associated with the objects. information associated with the objects.

Sensors 2016, 16, 981

11 of 12

Cups and spray bottles look similar to each other, especially when they are held in a human hand, because of their cylindrical structure. Even some phones, such as cordless home phones, have appearances that are similar to cups and spray bottles in the feature space (due to their rectangular form). From the experimental results, we confirmed that our method greatly facilitates distinction between similar looking objects from different categories by efficiently exploiting the action information associated with the objects. 5. Conclusions This work focused on the efficient use of object-action context to resolve the inherent difficulty of object recognition caused by large intra-category appearance variations and inter-category appearance similarities. To accomplish this, we proposed a method that integrates how humans interact with objects into object recognition. The probabilities of objects and actions have been computed effectively using random forest and multi-class Adaboost algorithms. Through experiments, we confirmed that a few key poses provide sufficient information for distinguishing human actions. When objects from different categories have similar appearances, the use of the human actions associated with each object can be effective in resolving ambiguities related to recognizing these objects. We also observed that when we have an insufficient amount of labelled objects, which inhibits recognition, carefully-designed statistical learning methods using handcrafted features are more adequate for obtaining an efficient solution, as compared to deep learning methods. Acknowledgments: This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry Education, Science and Technology (2013R1A1A2006164). Author Contributions: Sungbaek Yoon conceived and designed the experiment; Sungbaek Yoon and Hyunjin Park performed the experiments; and Sungbaek Yoon and Juneho Yi wrote the paper. Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations The following abbreviations are used in this manuscript: HMM CNN SVM HOG k-NN ReLU BoV STIP

Hidden Markov Model Convolutional Neural Network Support Vector Machine Histogram of Oriented Gradients k Nearest Neighbors Rectified Linear Unit Bag of Visual words Space Time Interest Point

References 1. 2. 3. 4. 5. 6. 7.

Kumar, S.; Herbert, M. A hierarchical field framework for unified context-based classification. In Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, 17–20 October 2005. Heitz, G.; Koller, D. Learning spatial context: Using stuff to find things. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008. Prest, A.; Schmid, C.; Ferrari, V. Weakly supervised learning of interaction between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 601–614. [CrossRef] [PubMed] Rabinoch, A.; Vedaldi, A.; Galleguillos, C.; Wiewiora, E.; Blongie, S. Objects in context. In Proceedings of the IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–20 October 2007. Wolf, L.; Bileschi, S. A critical view of context. Int. J. Comput. Vis. 2006, 69, 251–261. [CrossRef] Harzallah, H.; Jurie, F.; Schmid, C. Combining efficient object localization and image classification. In Proceedings of the International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009. Murphy, K.; Torralba, A.; Freeman, W. Using the forest to see the tree: a graphical model relating features, objects and the scenes. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–13 December 2003.

Sensors 2016, 16, 981

8. 9. 10. 11.

12. 13. 14.

15. 16. 17.

18. 19. 20. 21. 22. 23. 24. 25. 26.

12 of 12

Torralba, A. Contextual priming for object detection. Int. J. Comput. Vis. 2003, 53, 169–191. [CrossRef] Strat, T.; Fischler, M. Context-based vision: Recognizing objects using information from both 2-d and 3-d imagery. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 1050–1065. [CrossRef] Moore, D.J.; Essa, I.A.; Hayes, M.H. Exploiting human actions and object context for recognition tasks. In Proceedings of the IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999. Gupta, A.; Kembhavi, A.; Davis, L.S. Observing human-object Interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 1775–1789. [CrossRef] [PubMed] Yao, B.; Li, F. Recognizing human-object Interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1691–1703. [PubMed] Grabner, H.; Gall, J.; Gool, L.V. What makes a chair a chair? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 21–25 June 2011. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Harrahs and Harveys, Lake Tahoe, CA, USA, 3–8 December 2012. Zhu, J.; Zou, H.; Rosser, S.; Hastie, T. Multi-class Adaboost. Stat. Interface 2009, 2, 349–360. Raptis, M.; Sigal, L. Psoelet key-framing: A model for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. Maji, S.; Bourdev, L.D.; Malik, J. Action recognition from a distributed representation of pose and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011. Bourdev, L.; Yang, F.; Fergus, R. Deep Poselets for Human Detection. Available online: http://arxiv.org/ abs/1407.0717 (accessed on 5 July 2015). Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005. Koenderink, J.; Doorn, A.V. Representation of local geometry in the visual system. Biol. Cybern. 1987, 55, 367–375. [CrossRef] [PubMed] Laptev, I.; Caputo, B.; Schuldt, C.; Lindeberg, T. Local velocity-adapted motion events for spatio-temporal recognition. Int. J. Comput. Vis. 2007, 108, 207–229. [CrossRef] Charkraborty, B.; Holte, M.B.; Moeslund, T.B.; Gonzàlez, J. Selective spatio-temproal interest points. Comput. Vis. Image Underst. 2012, 116, 396–410. [CrossRef] ConvNetJs CIFAR-10 Demo. Available online: http://cs.stanford.edu/peoplekarpathy/convnejs/cifar10. html (accessed on 25 September 2015). Action Videos. Available online: https://vision.skku.ac.kr (accessed on 5 February 2014). Google Images. Available online: https://images.google.com (accessed on 5 February 2014). ImageNet. Available online: https://www/image-net.org (accessed on 5 December 2014). © 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

An Efficient Bayesian Approach to Exploit the Context of Object-Action Interaction for Object Recognition.

This research features object recognition that exploits the context of object-action interaction to enhance the recognition performance. Since objects...
5MB Sizes 0 Downloads 9 Views