IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 9, SEPTEMBER 2014

1485

Introducing Memory and Association Mechanism Into a Biologically Inspired Visual Model Hong Qiao, Senior Member, IEEE, Yinlin Li, Tang Tang, and Peng Wang

Abstract—A famous biologically inspired hierarchical model (HMAX model), which was proposed recently and corresponds to V1 to V4 of the ventral pathway in primate visual cortex, has been successfully applied to multiple visual recognition tasks. The model is able to achieve a set of position- and scale-tolerant recognition, which is a central problem in pattern recognition. In this paper, based on some other biological experimental evidence, we introduce the memory and association mechanism into the HMAX model. The main contributions of the work are: 1) mimicking the active memory and association mechanism and adding the top down adjustment to the HMAX model, which is the first try to add the active adjustment to this famous model and 2) from the perspective of information, algorithms based on the new model can reduce the computation storage and have a good recognition performance. The new model is also applied to object recognition processes. The primary experimental results show that our method is efficient with a much lower memory requirement. Index Terms—Association, biologically inspired visual model, memory, object recognition.

I. Introduction SERIES of neural computational models have been proposed for vision processes based on the biological mechanism and have been successfully applied to pattern recognition [1]–[4]. Recently, biological research with information technology integration is an important trend of research. In particular, in 1999, Riesenhuber and Poggio proposed a famous neural computational model for vision process [2]. This model is a hierarchical feed-forward model of the ventral stream of primate visual cortex, which is briefly called as HMAX. The model is closely related to biological results. Each level in the model was designed according to data of anatomy and physiology experiments, which mimicked the architecture and response characteristics of the visual cortex. Giese and Poggio [5] further extended the above model to biological motion recognition. They established a hierarchical

A

Manuscript received July 4, 2013; revised September 27, 2013; accepted October 10, 2013. Date of publication October 30, 2013; date of current version August 14, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61033011, Grant 61210009, and Grant 61379093, and in part by the National Key Technology Research and Development Program 2012BAI34B02. This paper was recommended by Associate Editor X. Li. The authors are with the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: [email protected]; [email protected]; [email protected]; peng [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2013.2287014

neural model with two parallel process streams, the form pathway and the motion pathway. A series of extended models have given a very good performance in biological motion recognition in cluster [6] and are compared with psychophysics results [7]. The model was also extended to a computational model for general object recognition tasks [8], which can output a set of position and scale invariance features by alternating between a template matching and a maximum pooling operation corresponding to S1-C2 layers. Some researchers introduced sparsification, feedback, and lateral inhibition [9], [10] into HMAX. These works have demonstrated a very good performance in a range of recognition tasks, such as face recognition [11], scene classification [12], and handwritten digit recognition [13], which is competitive with the state-of-the-art approaches. Moreover, inspired by the HMAX model, some researchers modified its max-pooling structure, and took the pooling layer as saliency templates for object tracking [14], [15]. On the other hand, inspired by the behavioral and neural architecture of the early primate visual system, Itti [3], [16] established a visual attention model. Itti and Poggio et al. [17] merged the saliency-based attention model proposed in the previous work [3] to HMAX to modify the activity of the S2 layer. Here, the S2 layer mimics the response characters of V4 in the primate visual cortex, which shows an attention modulation in electrophysiology and psychophysics experiments. Saliency-based visual attention models are further compared with behavioral experiments [18], which shows that the models could account for a significant portion of human gaze behavior in a naturalistic, interactive setting. A series of visual attention models [19] have demonstrated successful applications in computer vision [20], mobile robotics [21], and cognitive systems [22]. There are also works trying to combine HMAX and deep belief networks (DBN). The HMAX model could extract position and scale invariance features with a feed-forward hierarchical architecture, but the lack of feedback limits its performance in pure classification tasks. DBN [23] has shown the state-of-the-art performance in a range of recognition tasks [24], [25]. DBN uses generative learning algorithms, utilizes feedback at all levels and provides ability to reconstruct complete or sample sensory data, but DBN lacks position and scale invariance in the HMAX. Therefore, combining DBN with HMAX is meaningful to extract more accurate features for pattern recognition. The above models are all biologically inspired models and have been successfully applied in practical work. Similar

c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

1486

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 9, SEPTEMBER 2014

works also include [26]–[30]. In this paper, based on other biological results, we try to introduce the memory and association mechanisms into the HMAX model to enhance its learning ability. The main contributions of this paper are explained in details as follows. 1) In the Object Memorizing Process: Our proposed model mimics certain characteristics of human’s memory mechanism. a) In our model, one object is memorized by semantic attributes and special image patches (corresponding to the episodic memory). The semantic attributes describe each part of the object with a clear physical meaning; for example, whether or not eyes and mouths of faces are big or small. One special patch is selected if the value of the corresponding semantic feature is far from the average one. The patches should be the prominent parts of the object. b) In our model, different features (semantic attributes and special patches) of an object are stored in distributed places, and the common feature of different objects is saved aggregately, which are used to classify the difference of similar features of different objects. The similarity thresholds for each object can be learned when new objects are learned. 2) In the Object Recognition Process: Compared with the biological process in which the associated recognition includes the familiarity discrimination and recollective matching, our proposed model first compares the special patches of the candidates with those of the saved objects using the HMAX model in mimicking the familiarity discrimination (knowing through the episode), where the candidates and the saved objects have the same prominent semantic features. Then, our model combines the comparison results of the special patches with those of the semantic features in mimicking the recollective matching. The remaining part of this paper is organized as follows. In Section II, we briefly summarize the biological basis and the supporting research of this paper. In Section III, the detailed description of the model is given. In Section IV, the recognition performance of the model on face and house datasets is evaluated and compared with the original model. Finally, some conclusions are given in Section V.

II. Biological Evidences for New Model In this paper, we propose a model based on HMAX and some basic facts about the memory and association mechanism established over the last decade via several physiological studies of cortex [31]–[34]. The accumulated evidences are now summarized. 1) Memory About an Object Includes the Episodic and Semantic Memory Tulving et al. [34] pointed out that the episodic memory and the semantic memory are two subsystems of

Fig. 1. [31].

Sketch of the relationship between semantic and episodic memory

declarative memory. Though the two systems share many features, the semantic memory is considered as the basis of the episodic memory that, compared with the semantic memory, has additional capabilities. In the memory encoding process, information is encoded into the semantic memory first and is then encoded into the episodic memory through the semantic memory, as shown in Fig. 1. 2) An Object Concept May be Represented by Discrete Cortical Regions Brain imaging studies show that object representation is related to different cortical regions, forming a distributed network that is parallel to the organization of sensory and motor systems [32]. For example, in the word generation experiments, ventral and lateral regions of the posterior temporal cortex have been found eliciting differentially with different types of information retrieved [35]. In other related behavioral experiments in which the subjects are required to name an action or a color, the specialization of different cortical regions has also been observed [35]. 3) Neurons Responding to the Common Feature of Different Objects Aggregate Together Researchers believe that the brain tends to process common features from different objects in the same areas [32]. One supporting fact for the assertion is the existence of the cortical region responding to the shape attributes of different visual categories. In the related behavioral experiments, when the subject was required to perform the same task with different object categories as stimuli, the ventral occipitotemporal cortex was consistently activated and encoded the object forms with distinct neural response patterns [32]. Another body of evidence comes from studies of cortical lesions [36], [37]. The damage to the temporal lobes is found to be strongly associated with the impairment of abilities to recognize objects and retrieve information about the object-specific characteristics [36], [37]. This implies that the temporal lobe is the region in which object-specific information from all categories is commonly stored.

QIAO et al.: INTRODUCING MEMORY AND ASSOCIATION MECHANISM INTO A BIOLOGICALLY INSPIRED VISUAL MODEL

Fig. 2.

1487

Proposed model: Introducing biologically inspired memory and association into the HMAX model.

Functional column may provide the anatomical basis accounting for the above phenomena, and recent studies suggest that the functional column is a basic processing architecture spreading over the visual neural systems [38]–[40]. On the surface of the infratemporal cortex that activated by faces, neurons with common preference were found aggregating in patches specifying different stimuli [38]. Similar organization was also seen in the area of posterior TE that is responsive for simple 2-D shapes [39]. It is inferred that such a columnar organization would produce stimulus invariance [41] and also provides a

visual alphabet from which numerous complex objects can be efficiently represented as a distributed code [40]. 4) Recognition Memory Includes Familiarity and Recollection Components As a basic functional part of memory, recognition memory has its use in identifying and judging whether the presented object has ever been captured consciously [33], [42]. It is widely accepted that two components make up the recognition memory: One is the familiarity discrimination that determines knowing though the episode in which the object has appeared may not be

1488

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 9, SEPTEMBER 2014

2) Fig. 3.

Semantic attributes and special episodic patches in memory.

recalled, and the other is the recollective matching that means remembering both the presence of the object and the related episode. Different mechanisms have been proposed to explain the two recognition experiences. Some researchers argue that the only difference between familiarity discrimination and recollection matching is the trace strength, in term of which the former is weaker than the latter [43]–[45], while others regard the two processes as qualitatively different modes of recognition memory, which may be executed by hippocampus and perirhinal cortex, respectively [33], [42]. 5) Familiarity Recognition Is Rapid and Accurate and Only Needs a Small Number of Neurons From the evolutionary perspective, owning a simplified recognition system focusing on familiarity is likely to have advantage in speed of response to novelty by saving time of deep recalling. This hypothesis has been confirmed in some related experiments, in which subjects made familiarity decisions faster than recollect decisions [34], [46], [47]. The computational efficiency of a recognition system dedicated for familiarity discrimination has also been demonstrated by the simulated neural networks [48]. Compared with systems relying on associative learning, familiarity decision can be made accurate and faster by a specially designed network with a smaller size of neuron population. All of the above works form the basis of our model which will be explained in Section III. III. New Model-Introducing Memory and Association Into Biologically Inspired Model The new model is presented in Fig. 2. This section includes five parts, which introduce the framework of our model and algorithms related to subprocesses. A. Framework of Our Model Fig. 2 presents our model which introduces biologically inspired memory and association mechanisms into HMAX. Our framework consists of five parts: block 1 to block 5. 1) Objects Are Memorized Through Episodic and Semantic Features (Block1) Oi represents the ith object (i = 1, . . . n), aji represents the jth semantic attribute (j = 1, . . . m) of the ith object,

3)

4)

5)

and sxi represents the xth special episodic patch of the ith object. In face recognition process, semantic description can be the eyes are large, the month is small and so on. The special episodic patch can be the image of an eye if the eyes are prominent on the face. These two kinds of features are just corresponding to the semantic and episodic memory of declarative memory in cognitive science [31], respectively. Features of One Object Are Saved in Distributed Memory and Common Features of Various Objects Are Stored Aggregately (Block2) A1 to Am represent various common semantic attributes of different objects. Sx is a library for special episodic patches of different objects. A common attribute of different objects is stored in the same regions, for example, a11 − a1n is the first feature of different objects (i = 1, . . . n) which are stored together, called as A1 . A1 not only has a clear biological area but also has a learning ability. For example, the similarity sensitivity would be learned when new objects are saved. This memory mechanism has clear biological evidences [32], [41]. Recognition of One Candidate (Block3) Tt represents a candidate. The semantic features a1t to amt and the episodic patch feature sxt are extracted before recognition. Familiarity Discrimination (Block4) Familiarity discrimination is achieved through the HMAX model. Both sxi and sxt are extracted in the C1 layer of the HMAX model. The saved object can be ordered by comparing similarity between the candidate and the saved objects in the C2 layer of the HMAX model. This process corresponds to the familiarity discrimination (knowing through the episode) in biological recognition memory [42]. Recollective Matching (Block5) Recollective matching is achieved through integration of the semantic and episode feature similarity analysis. We compare the semantic features of the candidate with those of the top saved objects according to the familiarity discrimination. If the difference between the candidate and the closest object does not exceed the threshold, which we have learned during the memory process, we consider that the candidate is the object. This procedure is illuminated by recollective matching in recognition memory [42].

B. Encoding Episodic and Semantic Features (Block 1) As we know, memory is the process by which information is encoded, stored, and retrieved, and the memory to an object with conscious process is called declarative memory, which can be divided into semantic memory and episodic memory [31], [49]. Semantic memory refers to the memory which is learned by people, such as meanings, understandings, conceptbased knowledge, and work-related skills, and facts [50]. Episodic memory refers to the memory which is explicitly located in the past involving the specific events and situations, such as times, places, associated emotions, and other contextual knowledge [51], [52].

QIAO et al.: INTRODUCING MEMORY AND ASSOCIATION MECHANISM INTO A BIOLOGICALLY INSPIRED VISUAL MODEL

1489

TABLE I Geometric Features of a Face

Fig. 4.

In this paper, the features of a saved object i include descriptive attributes a1i to ami and special patch sxi (Fig. 3), where descriptive attributes correspond to semantic features, and special patches correspond to episodic features. 1) Semantic Features: In biological process, human would memorize an object using semantic features. In computational process, semantic features can reduce the memory size requirement. However, to get semantic features, a big dataset and learning ability are needed, which correspond to the prior knowledge. A simplified algorithm is presented as follows. First, we need to extract semantic features aji from patches with a clear physical meaning (such as the images of eyes, months of one face) for each object i. Then, we can get an average view a¯ j of each feature aji . At last, we compare aji and a¯ j (j = 1, . . . m) to justify the semantic characteristics of a particular object. Take face recognition as an example. In this paper, the active shape model (ASM) is used to extract typical points from faces. Avoiding to describe one part big or small, which can be vague, we compute effective geometric features of organs (such as eyes, nose, and mouth), which are given in Table I. In biological study, these features are considered to be the most discriminative ones for recognizing a person by humans and therefore a reasonable choice for our biologically plausible model [53], [54]. More importantly, these facial features of high visual saliency are essential for our goal of modeling semantic recognition process, as they are basic elements of semantic descriptions of a human face. We can establish an average value for each geometric feature. By comparing each individual with the average face, we can get the computational semantic attributes for each part. In the experiments, we will show that these geometric features have similarities with general semantic meaning, such as the eyes are large and so on. Therefore, these features are used as semantic features in the face experiment. It should be noted that representing the geometric features of a face component should consider the influences of expression, view direction, light, and even subjective feeling of an observer. In order to get fair results, our current dataset is collected with frontal faces with less expression, and the semantic features are computed as   Qi Qi   g¯ ji aji = , g¯ ji = gjiq /Qi , σji =  (gjiq − g¯ ji )2 /Qi (1) σji q=1 q=1 where gjiq is the measured jth geometric features from the qth image of the ith object, g¯ ji is the average value of gjiq over q,

Distributed memory structure.

and σji is the standard deviation of the jth geometric feature from the ith object. For one person, many other attributes, such as gender, race, and age can be included in further research. Some events and their relationships can enrich the memory processes. 2) Episodic Patches: In order to mimic episodic memory, we need to find out the dominant patches of an object by finding out the prominent semantic features of the object i (2). Take face recognition as an example again. First, we need to find out prominent semantic features for physical parts, such as eyes, nose and mouth {jˆ ik } =

arg max (aji − a¯ j 2 ), k = 1, 2 . . . Ntop

{jik }⊂{1,2,...m}

(2)

where {jˆ ik } are the Ntop dominant semantic features of the ith object, and a¯ j is the jth semantic feature averaged over all objects. Special episodic patches could be extracted, which corresponds to {jˆ ik } of the object Oi and are put into the HMAX model proposed by [2], which can extract a set of position and scale tolerant features. Different from the original HMAX model, only episodic patches corresponding to the prominent part of a candidate are put into the model, and only those of the known objects are stored in a library. C. Distributed Memory Structures (Block 2) How to remember and organize different features is a key problem in the memory process. It would also influence the retrieval and association processes. This is also an important part of our framework. Studies of cognitive sciences have indicated that the extraction, storage and retrieval of different features are realized by the distributed cortical areas. As we have listed in Section II, many data and evidence show that an object concept may be represented by the discrete cortical regions [35] and neurons responding to the common feature of different objects aggregated together [41]. In this paper, we mimic this distributed structure to memorize and retrieve features. Suppose there are m visual cortical regions A1 to Am (Fig. 4), which are sensitive to different kinds of features a1t to amt . Thus, the candidates which have the feature ajt will activate a distinctive response in region Aj . This kind of distributed structures has great advantages. The visual cortical region Aj can be more and more sensitive and effective through comparing different aji . Two similarity thresholds for semantic attributes and episodic patches which

1490

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 9, SEPTEMBER 2014

Algorithm 1: Memory Process Input : Objects to be learned, {Oi } Output: Semantic features {aji }, special episodic patches {sxi }, thresholds {thresi } for judging an object as ’known’ 1: for each Oi do 2: Extract geometric attributes gjiq from all images related to the ith object 3: Compute semantic features aji with gjiq by (1) 4: Select prominent semantic features as the prominent feature by (2) 5: Extract special episodic patch sxi corresponding to the Fig. 5. (a) Face images used in the experiments. (b) The first person’s eight jˆ i th features pictures used in the memory process. (c) The first person’s eight testing 6: Compute thresi by (4) pictures. 7: end for Algorithm 2: Recognition Process Input : Object Tt to be recognized, semantic features {aji }, {sxi }, thresholds {thresi } learned before Output: Identity ˆi of the incoming object 1: Extract episodic features {sxt } of Ot using the HMAX model 2: for each Oi learned before do 3: Compute similarities between sxt and sxi which come from the same semantic part of the incoming object and the learned objects, respectively 4: end for 5: Sort learned objects by (3) 6: Select the top-ranking learned objects in term of similarity for further recognition 7: Extract semantic features {ajt } 8: Select the most similar learned object ˆi by (5) 9: if Object is ’known’ by (4) then 10: Output the recognized identity ˆi 11: else 12: Label the object as ’new’ 13: end if

would decide if a candidate is a known object are learned when a new object is memorized. In the further work, the attributes of one object would also be connected and influence each other. D. Recognition Based on Familiarity Discrimination and Recollective Matching (Block 4 and 5) As shown in Section II, there are two processes in the recognition memory: familiarity discrimination and recollective matching [42], and their combination is useful for a fast and accurate human recognition task. In the proposed framework, we also have two processes, corresponding to familiarity discrimination and recollective matching. 1) Familiarity Discrimination: The purpose of familiarity discrimination is to feel whether or not the given subject is known. After a special patch similarity analysis, it is necessary to decide if the subject is known rapidly. Then, the problem is

Fig. 6. (a) Labeled control points of some face images. (b) One special patch for one organ (in scale 4). (c) Two special patches for one organ (in scale 4).

how to design the threshold of each saved subject which can be used to decide if the candidate is the saved subject. In block 1, we establish a library of episodic patches {sxi } of different known objects in the memory process. In the familiarity discrimination of the association process, a special episodic patch sxt of the candidate Tt is extracted and put into the C1 layer of the HMAX model. Then, we compute the S2 and C2 level features of the candidate and the known objects which have the same prominent semantic attribute. Finally, we sort the known objects through their C2 feature similarity compared with the candidate. However, instead of presenting the result of the familiarity process by yes (the dissimilarity value is smaller than thres) and no (the dissimilarity value is larger than thres), in this paper, the process result is presented by the probability of one candidate being a known subject that is computed by p(Oi |Tt ) =

p(Tt |Oi )p(Oi ) . N  p(Tt |Oi )p(Oi )

(3)

i=1

As we assume all p(Oi ) are the same, the identities are actually sorted by p(Tt |Oi ) = exp(−d), where d is the Euclidean distance between features of the learned face and the incoming face. Then, only the top-ranking candidates in term of similarity are retained for the next step of recognition.

QIAO et al.: INTRODUCING MEMORY AND ASSOCIATION MECHANISM INTO A BIOLOGICALLY INSPIRED VISUAL MODEL

TABLE II

TABLE III

Memorized Features for One Known Person Are Listed, and scalesize = 4 ([4, 6, 8, 10] in C1 Layer), p Represents the Number of Prominent Organs (Eye, Nose, Mouth), t Represents the Image Number of Each Known Person Which Is Eight for CASIA-RTA Database, pn Represents the Patches Numbers for One Prominent Organ (Different Scale Is Not Concerned), and pn=1, 2 Are Both Tested As Shown in Fig. 6(b) and (c)

Similarity Analysis and Rank

1491

TABLE IV Thresholds for Five Persons’ Images

TABLE V Dissimilarities Between Three Other Persons’ Images and Five Known Persons’ Images

Fig. 7. (a) First and second persons with big eyes, and the third one with small eyes. (b) Candidates in the special patch evaluation.

This process corresponds to the familiarity discrimination in recognition memory [42]. 2) Recollective Matching: We compare the semantic features ajt of the object with those of the top known objects obtained from familiarity discrimination. If the smallest dissimilarity does not exceed the thresholds, which have been learned during the memory process, the candidate is recognized as the closest known object. Otherwise, we regard the candidate as a new one. This procedure is illuminated by recollective matching in recognition memory [42]. In this paper, the threshold for a known identity i is given as follows:

Finally, if the object is considered as known, the identification of the object is found in the remaining candidates passed from the familiarity discrimination process by m  ˆi = arg min (ajt − aji 2 ). (5) i∈{itop }

j=1

The memory and recognition algorithms are given as algorithm 1 and 2, respectively. IV. Experiments

thres1i = max(fiq − ¯fi 2 )

In this section, we evaluate our model on two databases: 1) CASIA-RTA (established by ourselves) and 2) CAS-PEALR1 [55].

q

(fi∗ − ¯fi 2 ) thres2i = min ∗ ∗ i ,i =i

thresi =

thres1i + λthres2i 1+λ

(4)

where fiq is the qth image feature of the ith object (ith object has Qi images), fi∗ is the average features of Qi images belonging to the i∗ th object, thres1i is the intraclass maximum difference, thres2i is the interclass minimum difference, λ is the ratio of thres1i and thres2i , and thresi is the final similarity threshold. Clearly, the larger the threshold is, the less sensitive the image is. λ is used to adjust the thresi to make it more flexible.

A. Face Recognition on CASIA-RTA We first apply our algorithms based on the new model to face recognition on CASIA-RTA. To make the application of algorithms clearly, we now describe the process in detail. This database contains front faces of seven persons with slight expression variations. The first five persons’ face images are used in both the memory and recognition phases, and each phase has eight different images for a person. Other two persons’ face images are only used in the recognition phase for unfamiliar detection. Some samples of the database are shown in Fig. 5.

1492

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 9, SEPTEMBER 2014

TABLE VI Probability and Rankings of First Four Persons’ Other Images Compared With Their Known Images Through Special Patches

Fig. 8. (a) Some persons’ images used in the experiments. (b) First person’s five pictures used in the memory process. (c) First person’s five testing pictures.

TABLE VII Thresholds for Attributes

TABLE VIII Face Recognition Precision of Our New Model on CASIA-RTA, Eight Training Images for Each Person

The experimental details are given as follows. 1) Features in Memory Phase: In the memory phase, for each image in CASIA-RTA, 68 control points of faces are extracted [see Fig. 6(a)], and then 17 semantic attributes (as shown in Table I) are calculated based on these points. Special patches are extracted according to one or two prominent face organs, and some examples are given in Fig. 6(b) and (c). The whole memory size for a known person is given in Table II. 2) Special Patches Evaluation: Although light, expression, scales, and other factors would influence the similarity analysis, our experimental results show that the similarity of the special patches through HAMX primarily matches the semantic description. For example, one candidate’s special patch is more similar to that with the same semantic description. In the memory process, eyes’ semantic descriptions of the first three persons [Fig. 7(a)] are first computed. The results indicate that the first two persons are with big eyes and the third person is with small eyes. Then, we extracted eyes’ patches which scaled with four and six in the C1 level. In the recognition process, three other pictures of the first three

persons are used [see Fig. 7(b)]. The similarity analysis and ranking are presented in Table III, in which patchi.j represents an eye patch with scale j ×j of the ith person. The patches are ranked according to their average values from the C2 layer. Eye patches of the first two persons generate higher values with portraits having bigger eyes, while the patches of the third person prefer the smaller ones. This rule is always true even in some cases that the eyes of the first two identities are confused. The results above demonstrate that the low-level features used in the model are reliable indicators of visual properties. By combining a sufficient number of such features, visual objects can be efficiently represented by the response patterns, just like the neural activities in certain cortical regions by which they are encoded. 3) Familiarity Discrimination: The thresholds for five known persons’ C2 level features are shown in Table IV and the dissimilarities between three other persons’ images and the five known persons’ images are given in Table V. The dissimilarities between three other persons’ images and the five known persons’ images are all larger than the threshold. Therefore, the three other persons can be recognized as unknown rapidly. Based on function (3), the first five persons’ images used for recognition are compared with their images used for memory. The probability and the corresponding rankings are given in Table VI (only the results of four persons’ three images are listed there). Each person’s other images are close to themselves through familiarity matching. Three known objects among top matching probability are reserved for further recognition. 4) Recognition by Recollective Matching: In this process, the similarity in semantic meaning is compared. The attribute thresholds computed by function (4) for five persons are given in Table VII. By computing the difference between the test image and the reserved persons’ attributes successively, and comparing with the attributes thresholds, the final label for each test image can be obtained. The recognition precision is shown in Table VIII. As shown in Table VIII, the precision does not improve with more patches of the same organ (as shown in Fig. 6), which may suggest that a little selected special patches can achieve

QIAO et al.: INTRODUCING MEMORY AND ASSOCIATION MECHANISM INTO A BIOLOGICALLY INSPIRED VISUAL MODEL

1493

Fig. 9. (a) Automatically labeled control points of some face images. (b) p=2, two special patches corresponding to two origins (scales: 4, 6). (c) Same number of random patches extracted by the HMAX model for comparison experiments.

good discriminant ability. This will be verified next in B with more comparison experiments. B. Face Recognition on CAS-PEAL-R1 We now evaluate the new model on a public face database CAS-PEAL-R1 [55]. It is a large-scale Chinese face database, which is collected under the sponsor of the Chinese National Hi-Tech Program and ISVISION Tech. Co. Ltd. Thirteen persons were selected for experiments, and each person has five images, respectively, in the memory and recognition phases. These images have slight expression, accessories, and lighting variations. Some samples of this database are shown in Fig. 8. 1) Feature and Parameter Selection: For CAS-PEAL-R1, Stasm [56] was used to extract 76 controlled points of face automatically, and p = 1-3, pn = 1, t = 5, scalesize = [4, 6, 8, 10] (see Table II for their definitions), other parameter settings and processes are the same as in the experiment on CASIARTA database. Some persons’ samples with control points and patches are shown in Fig. 9. 2) Empirical Results: In this part, we compare our new model with random patch selection method, which keeps the random patch selection mechanism in the HMAX model, and is embedded into the new model’s similarity measure framework, and it is called as RandPatch bellow. Then, the comparison experiments are carried on CAS-PEAL-R1 with small number of training samples for each person, and the results are given in Table IX. By varying the number of the special organs (eye, nose, mouth) from one to three, first, the best precision of our new model is 90.77% when two special organs are selected (four scales, eight patches in total for an image). It is worth mentioning that this high precision is achieved by combining special patches and semantic features, whose precision is only 72.31% and 69.23%, respectively. Second, we test the performance of RandPatch, and the sample images are cropped to contain only face regions in order to remove disturbs from the background. The best result of RandPatch is 70.77%, which extracts 12 patches in one image at random positions on the face. Although randomly extracted

TABLE IX Face Recognition Precision of RandPatch and Our New Model on CAS-PEAL-R1 With Small Number of Training Samples (Five Training Images for Each Person)

patches cover a wider range of face regions, it seems that RandPatch could not extract discriminant patches, and because of its random mechanism, the result is not stable. The memory size is a critical indicator of an algorithm. As there are two primary factors to influence the memory size: the sample numbers of the known persons and the patch numbers for every face image. Their relations with recognition precision are shown in Table IX and Fig. 10. First, as shown in Table IX, by keeping the image sample number constant (five images for each person), we find that the precision of RandPatch does not improve clearly with more patches. The result of the new model has a remarkable improvement when patch numbers increase from four to eight and then stays stable with the number of patches continuing to increase. Second, as shown in Fig. 10, we select two special organs (corresponding to two patches, four scales) for every image. Increasing the sample image numbers of the known person from one to five, the precision of special patches gets maximum, and stays nearly stable since the sample number is three or more, which indicates that familiarity discrimination can achieve a good performance with a small number of sample images. Meanwhile, the precision of semantic features and its combination with familiarity discrimination is increasing from 32.31% to 67.69%, and from 52.31% to 90.77%, respectively.

1494

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 9, SEPTEMBER 2014

The experimental results show that in the new method, the required memory is small and the recognition performance is good. In the future work, we will mimic the function and mechanism of primate’s higher visual cortex (as IT, PFC), to achieve primary cognition of orientation, distortion, and occlusion. It will further improve our model’s learning ability. Acknowledgment The authors would like to thank Dr. W. Wu and Dr. T. Chen for discussions and suggestions which improved this paper, and the associate editor and the reviewers for their constructive comments. References Fig. 10. Comparison results with small number of training images: the sample images of each person vary from one to five (two patches for each image).

Moreover, the performance of RandPatch is also tested. We crop the sample images to contain only face regions as before, and the maximum precision of RandPatch is 83.08%, which is still lower compared with our new model. To obtain a similar precision, RandPatch needs more patches, which is computationally expensive. Furthermore, when the sample numbers are three or more, the performance of RandPatch is stable or even gets poorer, which makes it difficult to further improve the performance of RandPatch. From the comparison experiments with small number of training images for each person, it is obvious that by combining familiarity discrimination and recollective matching together, the new model can have a better performance with a smaller memory requirement and therefore is more efficient than RandPatch. V. Conclusion In this paper, the biologically inspired memory and association model has been proposed. The typical features of the proposed model include: 1) in both the memory and recognition processes, the objects have semantic and episodic features, and the episodic patch is extracted according to the semantic feature; 2) the sensitivities to the semantic and episodic features are learnt during the process; 3) in the association process, the semantic and episodic features are separately compared and the results are integrated, which correspond to familiarity and recognitive matching. Through six blocks in Fig. 2, the above model is introduced to the HMAX model and the corresponding algorithms are given. The memory and association mechanism provides a top-down control function in the process, which can actively select the important features in memory and association and influence the recognition process.

[1] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in cats visual cortex,” J. Physiol., vol. 160, no. 1, p. 106, 1962. [2] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex,” Nat. Neurosci., vol. 2, no. 11, pp. 1019–1025, 1999. [3] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998. [4] N. C. Rust, V. Mante, E. P. Simoncelli, and J. A. Movshon, “How MT cells analyze the motion of visual patterns,” Nat. Neurosci., vol. 9, no. 11, pp. 1421–1431, 2006. [5] M. A. Giese and T. Poggio, “Neural mechanisms for the recognition of biological movements,” Nat. Rev. Neurosci., vol. 4, no. 3, pp. 179–192, 2003. [6] R. Sigala, T. Serre, T. Poggio, and M. Giese, “Learning features of intermediate complexity for the recognition of biological motion,” in Proc. ICANN, 2005, pp. 241–246. [7] A. Casile and M. A. Giese, “Critical features for the recognition of biological motion,” J. Vis., vol. 5, no. 4, pp. 348–360, 2005. [8] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust object recognition with cortex-like mechanisms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp. 411–426, Mar. 2007. [9] Y. Huang, K. Huang, D. Tao, T. Tan, and X. Li, “Enhanced biologically inspired model for object recognition,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 6, pp. 1668–1680, Dec. 2011. [10] J. Mutch and D. G. Lowe, “Multiclass object recognition with sparse, localized features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., vol. 1. Jun. 2006, pp. 11–18. [11] E. Meyers and L. Wolf, “Using biologically inspired features for face processing,” Int. J. Comput. Vision., vol. 76, no. 1, pp. 93–104, 2008. [12] K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan, “Biologically inspired features for scene classification in video surveillance,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 307–313, Feb. 2011. [13] M. Hamidi and A. Borji, “Invariance analysis of modified c2 features: Case study-handwritten digit recognition,” Mach. Vision. Appl., vol. 21, no. 6, pp. 969–979, 2010. [14] S. Han and N. Vasconcelos, “Biologically plausible saliency mechanisms improve feedforward object recognition,” Vison Res., vol. 50, no. 22, pp. 2295–2307, 2010. [15] V. Mahadevan and N. Vasconcelos, “Biologically inspired object tracking using center-surround saliency mechanisms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 3, pp. 541–554, Mar. 2013. [16] L. Itti and C. Koch, “Computational modeling of visual attention,” Nat. Rev. Neurosci., vol. 2, no. 3, pp. 194–203, 2001. [17] D. Walther, L. Itti, M. Riesenhuber, T. Poggio, and C. Koch, “Attentional selection for object recognition a gentle way,” in Proc. Biol. Motivated Comput. Vision, 2002, pp. 472–479. [18] R. J. Peters and L. Itti, “Applying computational tools to predict gaze direction in interactive visual environments,” ACM. Trans. Appl. Percept., vol. 5, no. 2, article no. 9, 2008. [19] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 185–207, Jan. 2013. [20] Y. F. Ma, X. S. Hua, L. Lu, and H. J. Zhang, “A generic framework of user attention model and its application in video summarization,” IEEE Trans. Multimedia, vol. 7, no. 5, pp. 907–919, Oct. 2005.

QIAO et al.: INTRODUCING MEMORY AND ASSOCIATION MECHANISM INTO A BIOLOGICALLY INSPIRED VISUAL MODEL

[21] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink, and G. Sagerer, “Providing the basis for human–robot-interaction: A multi-modal attention system for a mobile robot,” in Proc. 5th Int. Conf. Multimodal Interfaces, 2003, pp. 28–35. [22] S. Frintrop, E. Rome, and H. I. Christensen, “Computational visual attention systems and their cognitive foundations: A survey,” ACM. Trans. Appl. Percept., vol. 7, no. 1, article no. 6, 2010. [23] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. [24] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proc. 26th Ann. Int. Conf. Mach. Learn., 2009, pp. 609–616. [25] A.-R. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 14–22, Jan. 2012. [26] H. Li, Y. Wei, L. Li, and C. L. P. Chen, “Hierarchical feature extraction with local neural response for image recognition,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 412–424, Apr. 2013. [27] E. Chinellato, B. J. Grzyb, and A. P. del Pobil, “Pose estimation through cue integration: A neuroscience-inspired approach,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 530–538, Apr. 2012. [28] M. Begum, F. Karray, G. K. I. Mann, and R. G. Gosine, “A probabilistic model of overt visual attention for cognitive robots,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 5, pp. 1305–1318, Oct. 2010. [29] Y. Yu, G. K. I. Mann, and R. G. Gosine, “An object-based visual attention model for robotic applications,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 5, pp. 1398–1412, Oct. 2010. [30] J. Tani, R. Nishimoto, J. Namikawa, and M. Ito, “Codevelopmental learning between human and humanoid robot using a dynamic neuralnetwork model,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 1, pp. 43–59, Feb. 2008. [31] E. Tulving and H. J. Markowitsch, “Episodic and declarative memory: Role of the hippocampus,” Hippocampus, vol. 8, no. 3, pp. 198–204, 1998. [32] A. Martin and L. L. Chao, “Semantic memory and the brain: Structure and processes,” Curr. Opin. Neurobiol., vol. 11, no. 2, pp. 194–201, 2001. [33] L. R. Squire, J. T. Wixted, and R. E. Clark, “Recognition memory and the medial temporal lobe: A new perspective,” Nat. Rev. Neurosci., vol. 8, no. 11, pp. 872–883, 2007. [34] B. McElree, P. O. Dolan, and L. L. Jacoby, “Isolating the contributions of familiarity and source information to item recognition: A time course analysis,” J. Exp. Psychol. Learn. Mem. Cogn., vol. 25, no. 3, pp. 563–582, 1999. [35] A. Martin, J. V. Haxby, F. M. Lalonde, C. L. Wiggs, and L. G. Ungerleider, “Discrete cortical regions associated with knowledge of color and knowledge of action,” Science, vol. 270, no. 5233, pp. 102–105, 1995. [36] J. Hart and B. Gordon, “Delineation of single-word semantic comprehension deficits in aphasia, with anatomical correlation,” Ann. Neurol., vol. 27, no. 3, pp. 226–231, 1990. [37] J. R. Hodges, K. Patterson, S. Oxbury, and E. Funnell, “Semantic dementia: Progressive fluent aphasia with temporal-lobe atrophy,” Brain, vol. 115, no. 6, pp. 1783–1806, 1992. [38] G. Wang, K. Tanaka, and M. Tanifuji, “Optical imaging of functional organization in the monkey inferotemporal cortex,” Science, vol. 272, no. 5268, pp. 1665–1668, 1996. [39] I. Fujita, K. Tanaka, M. Ito, and K. Cheng, “Columns for visual features of objects in monkey inferotemporal cortex,” Nature, vol. 360, no. 6402, pp. 343–346, 1992. [40] M. P. Stryker, “Neurobiology: Elements of visual-perception,” Nature, vol. 360, no. 6402, pp. 301–302, 1992. [41] M. J. Tovee, “Is face processing special?” Neuron, vol. 21, no. 6, pp. 1239–1242, 1998. [42] M. W. Brown and J. P. Aggleton, “Recognition memory: What are the roles of the perirhinal cortex and hippocampus?” Nat. Rev. Neurosci., vol. 2, no. 1, pp. 51–61, 2001. [43] F. Haist, A. P. Shimamura, and L. R. Squire, “On the relationship between recall and recognition memory,” J. Exp. Psychol. Learn. Mem. Cogn., vol. 18, no. 4, pp. 691–702, 1992. [44] E. Hirshman and S. Master, “Modeling the conscious correlates of recognition memory: Reflections on the remember-know paradigm,” Mem. Cogn., vol. 25, no. 3, pp. 345–351, 1997. [45] W. Donaldson, “The role of decision processes in remembering and knowing,” Mem. Cogn., vol. 24, no. 4, pp. 523–533, 1996.

1495

[46] M. Seeck, C. M. Michel, N. Mainwaring, R. Cosgrove, H. Blume, J. Ives, et al., “Evidence for rapid face recognition from human scalp and intracranial electrodes,” Neuroreport, vol. 8, no. 12, pp. 2749–2754, 1997. [47] D. L. Hintzman, D. A. Caulton, and D. J. Levitin, “Retrieval dynamics in recognition and list discrimination: Further evidence of separate processes of familiarity and recall,” Mem. Cogn., vol. 26, no. 3, pp. 449–462, 1998. [48] R. Bogacz, M. W. Brown, and C. Giraud Carrier, “High capacity neural networks for familiarity discrimination,” in Proc. ICANN, vol. 2. 1999, pp. 773–778. [49] L. R. Squire and S. M. Zola, “Episodic memory, semantic memory, and amnesia,” Hippocampus, vol. 8, no. 3, pp. 205–211, 1998. [50] D. Saumier and H. Chertkow, “Semantic memory,” Curr. Neurol. Neurosci., vol. 2, no. 6, pp. 516–22, 2002. [51] H. Eichenbaum, “A cortical-hippocampal system for declarative memory,” Nat. Rev. Neurosci., vol. 1, no. 1, pp. 41–50, 2000. [52] M. A. Conway, “Episodic memories,” Neuropsychologia, vol. 47, no. 11, pp. 2305–2313, 2009. [53] C. Renzi, S. Schiavi, C.-C. Carbon, T. Vecchi, J. Silvanto, and Z. Cattaneo, “Processing of featural and configural aspects of faces is lateralized in dorsolateral prefrontal cortex: A TMS study,” NeuroImage, 2013. [54] S. Yamane, S. Kaji, and K. Kawano, “What facial features activate face neurons in the inferotemporal cortex of the monkey?” Exp. Brain. Res., vol. 73, no. 1, pp. 209–214, 1988. [55] W. Gao, B. Cao, S. Shan, X. Chen, D. Zhou, X. Zhang, et al., “The CAS-PEAL large-scale chinese face database and baseline evaluations,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 1, pp. 149–161, Jan. 2008. [56] S. Milborrow and F. Nicolls, “Locating facial features with an extended active shape model,” in Proc. ECCV, 2008, pp. 504–513.

Hong Qiao (SM’06) received the B.Eng. degree in hydraulics and control, the M.Eng. degree in robotics from Xian Jiaotong University, Xian, China, the M.Phil. degree in robotics control from the Industrial Control Center, University of Strathclyde, Strathclyde, U.K., and the Ph.D. degree in robotics and artificial intelligence from De Montfort University, Leicester, U.K., in 1995. She was a University Research Fellow with De Montfort University from 1995 to 1997. She was a Research Assistant Professor from 1997 to 2000 and an Assistant Professor from 2000 to 2002 with the Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong. In 2002, she joined as a Lecturer with the School of Informatics, University of Manchester, Manchester, U.K. She is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. She first proposed the concept of the attractive region in strategy investigation that she successfully applied for robot assembly, robot grasping, and part recognition, which is reported in Advanced Manufacturing Alert (Wiley, 1999). Her current research interests include information-based strategy investigation, robotics and intelligent agents, animation, machine learning, and pattern recognition. Dr. Qiao was a member on the Program Committee of the IEEE International Conference on Robotics and Automation from 2001 to 2004. She is currently an Associate Editor of the IEEE Transactions on Systems, Man, and Cybernetics, Part B, and the IEEE Transactions on Automation Science and Engineering.

Yinlin Li received the B.Sc. degree from Xidian University, Xian, China, in 2011. She is currently pursuing the Ph.D. degree in pattern recognition and intelligent system at the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her current research interests include biological inspired visual algorithms, object detection and scene understanding.

1496

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 9, SEPTEMBER 2014

Tang Tang received the B.Sc. degree from the Qingdao University of Science and Technology, Qingdao, China, in 2009. He is currently pursuing the Ph.D. degree in pattern recognition and intelligent system at the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include computer vision, machine learning, and biologically inspired algorithms.

Peng Wang received the B.Eng. degree in electrical engineering and automation from Harbin Engineering University, Harbin, China, in 2004, and the M.Eng. degree in automation science and engineering from the Harbin Institute of Technology, Harbin, China, in 2007, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2010. He is currently an Associate Professor with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. He has worked on computer vision, image processing and intelligent robot. His current research interests include visual tracking, visual detection, visual attention model and visual perception model.

Introducing memory and association mechanism into a biologically inspired visual model.

A famous biologically inspired hierarchical model (HMAX model), which was proposed recently and corresponds to V1 to V4 of the ventral pathway in prim...
13MB Sizes 0 Downloads 0 Views