734

J. Opt. Soc. Am. A / Vol. 31, No. 4 / April 2014

Li et al.

Biologically inspired multilevel approach for multiple moving targets detection from airborne forward-looking infrared sequences Yansheng Li, Yihua Tan,* Hang Li, Tao Li, and Jinwen Tian National Key Laboratory of Science & Technology on Multi-spectral Information Processing, School of Automation, Huazhong University of Science and Technology, Wuhan 430074, China *Corresponding author: [email protected] Received October 15, 2013; revised January 23, 2014; accepted February 7, 2014; posted February 11, 2014 (Doc. ID 199445); published March 17, 2014 In this paper, a biologically inspired multilevel approach for simultaneously detecting multiple independently moving targets from airborne forward-looking infrared (FLIR) sequences is proposed. Due to the moving platform, low contrast infrared images, and nonrepeatability of the target signature, moving targets detection from FLIR sequences is still an open problem. Avoiding six parameter affine or eight parameter planar projective transformation matrix estimation of two adjacent frames, which are utilized by existing moving targets detection approaches to cope with the moving infrared camera and have become the bottleneck for the further elevation of the moving targets detection performance, the proposed moving targets detection approach comprises three sequential modules: motion perception for efficiently extracting motion cues, attended motion views extraction for coarsely localizing moving targets, and appearance perception in the local attended motion views for accurately detecting moving targets. Experimental results demonstrate that the proposed approach is efficient and outperforms the compared state-of-the-art approaches. © 2014 Optical Society of America OCIS codes: (040.2480) FLIR, forward-looking infrared; (330.4150) Motion detection; (330.4270) Vision system neurophysiology; (100.2960) Image analysis. http://dx.doi.org/10.1364/JOSAA.31.000734

1. INTRODUCTION Moving targets detection from airborne image sequences plays an important role in aerial surveillance and is a key technique in the automatic target recognition (ATR) system. As the foundation of high-level tasks, such as target recognition, automatic tracking system initialization, and recovery from tracking failure, moving targets detection from airborne image sequences has attracted considerable research interest. Generally, the existing moving targets detection methods from airborne image sequences can be categorized into two approaches: namely, visible imagery-based approach [1–9] and infrared imagery-based approach [9–12]. While a variety of moving targets detection methods have been proposed for visible imagery, there exists a limited amount of work on infrared imagery. As a result of thermal-based imaging, infrared images can provide information that is not available in the visible images and can be utilized to detect and track targets in day and night. However, compared with visible images, more challenges are involved in infrared images, such as extremely low signal-to-noise ratio (SNR), which results in limited information for targets detection and tracking. Moreover, moving targets detection from airborne forward-looking infrared (FLIR) sequences is confronted with numerous difficulties, such as the complex ground clutter, lack of searching scope prior constraints, high ego-motion of the infrared sensory, nonrepeatability of target signature, and the artifacts due to weather conditions. As a whole, moving targets detection from airborne FLIR sequences is a vital and challenging task and is the focus of this paper. 1084-7529/14/040734-11$15.00/0

Along with the high-speed optical flow techniques [13,14] available, the consistent foreground motion clustering based on optical flow may be an apparent approach to detect moving targets. Nevertheless, the optical flow-based approaches generally face the challenge of resolvability of targets. Moreover, for moving targets detection in infrared sequences, the optical flow calculation itself may be inaccurate as infrared images generally contain certain noises. Because of that, the optical flow generally cannot be directly utilized to implement moving targets detection, especially in infrared sequences. In the state-of-the-art moving targets detection approaches [9–12] for airborne infrared imagery, six parameter affine or eight parameter planar projective transformation matrix estimation of two adjacent frames is an indispensable step. A variety of transformation matrix estimation methods [12,15], such as gradient-based method, coarse optical flow-based method, and feature point-based method, have been proposed. However, none of the algorithms can efficiently and robustly work on high-noise and low-contrast infrared images. Given this consideration, it is necessary to improve the transformation matrix estimation method in poor infrared images for elevating the moving targets detection performance from airborne infrared imagery. Actually, we can switch to another direction. The human visual system possesses the remarkable ability to pick out the saliency targets in images, even in the presence of noise, poor weather, and other implements to perfect vision [16]. Moreover, the human visual system owns the very same ability to detect moving targets even in poor conditions, where the accurate estimation of transformation © 2014 Optical Society of America

Li et al.

matrix may be unachievable. The two-streams hypothesis reveals the mechanism of human visual perception [17–20]. Inspired by the visual perception theory, this paper proposes a novel moving targets detection framework aiming at the airborne FLIR sequences, which avoids the transformation matrix estimation. The two-streams hypothesis is a widely accepted and influential model of the neural processing of vision. Specifically, the visual cortex is organized in two separated functional streams, namely the ventral stream and dorsal stream. According to the difference of the vision task, the two-streams hypothesis can be interpreted as the what versus where model [17], form versus motion model [18], and perception versus action model [19]. Consistent with the existing two-streams models [17–19], Norman concluded the basic functions of two separated streams [20]. The dorsal stream with a high temporal sensitivity, originating from the response of magnocellular cells of lateral geniculate nucleus (LGN), is mainly responsible for motion perception, location, and visual guidance; however, the ventral stream with a spatial sensitivity, originating from the response of parvocellular cells of LGN, is mainly in charge of object perception, such as recognition, identification, detection, and so forth. From the perspective of information acquisition, the useful information is separated to provide more efficient signals to both streams [21]. While the existing two-streams models [17–20] mainly build on the independence of two streams, the accumulated evidences demonstrate that the independence of the two-streams models has been overemphasized [22]. Through collecting data provided by several research groups, Nowak et al. [23] noticed that the conduction and activation speed of the dorsal stream is much faster than the ventral stream. They also noted that the difference of conduction and activation latency permits that the feedforward projection from higher dorsal areas can be sent to the early ventral areas. Since the two streams are in charge of different aspects of visual perception, the modulation from the dorsal stream may naturally reinforce the object perception performance of the ventral stream. Actually, the experiments [24] have showed that the feedforward projection from the dorsal stream can facilitate computation of the ventral stream in a push–pull fashion, which improves neural responses toward moving targets and suppresses background activations. As far as the moving targets detection task, the ventral stream can be viewed as a semi-autonomous function that operates under the feedforward modulation, which is informed by dorsal stream processing. Instead of finely simulating the cortical processing of each stream, our approach just focuses on imitating the basic function of each visual stream and is intended for a computational model for moving targets detection. Therefore, the proposed moving targets detection framework can be implemented by three sequential modules in a multilevel processing style: motion perception module corresponding to the dorsal stream, attended motion views extraction module corresponding to the feedforward modulation of the dorsal stream, appearance perception module corresponding to the ventral stream. While the image-difference based on transformation matrix estimation can be utilized to extract motion cues, it is not the only choice. Actually, in our implementation, the motion cues were extracted by the motion saliency computational model [25,26]. So our

Vol. 31, No. 4 / April 2014 / J. Opt. Soc. Am. A

735

proposed moving targets detection framework can skillfully avoid the transformation matrix estimation problem in poor infrared sequences. From the separation perspective of appearance and motion perception, it is helpful to efficiently detect moving targets, especially when the moving targets distributed densely. The detail of each functional module is introduced in the following. The rest of this paper is organized as follows. Section 2 presents the existing techniques related to the moving targets detection from airborne FLIR sequences. Section 3 gives a detail of the proposed approach. The testing airborne FLIR sequences, an overall performance of the proposed approach, and a comparison result with other state-of-the-art approaches are depicted in Section 4. Finally, Section 5 gives a conclusion of this paper.

2. RELATED WORK In this section, the existing targets detection approaches from static infrared images and moving targets detection approaches from infrared image sequences are reviewed. When the thermal imaging system is far from the interesting target, the infrared target itself presents as a hot spot in the imaging plane. Since the hot spot is relatively significantly different from the background clutter, a variety of methods [27–32] have been available to detect it. Because of the important role in the ATR system, lots of methods [33–37] have focused on targets detection from airborne FLIR imagery. As the airborne platform is relatively close to the interesting target, the resolution of the target becomes high, and the infrared target presents as a camouflaged appearance [9,33]. Unfortunately, the camouflaged appearance is easily confused with the ground background clutter. Compared with the hot spot detection, how to detect the whole camouflaged infrared target is still an intractable problem. Different from the infrared small targets detection approaches [27–32], Yilmaz et al. [33] utilized fuzzy clustering, edge fusion, and local texture energy to detect infrared targets as an initiation of the tracking system. In [34], a spatiotemporal morphological connected operator was utilized to detect infrared targets. Der et al. [35] proposed to form a confidence image of the potential targets using the multicues difference between targets and the background. The potential targets could be located by thresholding the confidence image and verified by the target-specific knowledge. In [36,37], the infrared targets were detected by two steps, namely the candidate targets generation and the potential targets confirmation based on an off-line trained target versus background model. The infrared targets detection approaches from still images mainly employ the assumption that the target infrared radiation is much stronger than the background and noise. However, it is not always true. So mining motion cues from sequences is a potential approach to increase detection rates and decrease false alarms. There existed a variety of moving targets detection methods from infrared image sequences [9–12,38,39], which played an important role in surveillance security, intelligent traffic, and modern war. In [38], a moving targets detection approach based on spatiotemporal information was designed for the video monitoring system. Chen et al. [39] utilized the prior knowledge of the on-board infrared camera to narrow the searching scope, which could work well in a vehicle-mounted system. Moreover, lots of work [9–12] focused on the moving

736

J. Opt. Soc. Am. A / Vol. 31, No. 4 / April 2014

Li et al.

targets detection from airborne infrared sequences. Without the stationary background premise in the video monitoring system and the prior knowledge of the infrared camera in the vehicle-mounted system, moving targets detection from airborne infrared sequences faced further challenges. Generally, the moving target in airborne infrared sequences may be confused with the ground clutter, its intensity appearance may be camouflaged and the contrast of infrared images is low, which makes the moving targets detection from airborne infrared sequences even harder. The existing state-of-the-art moving targets detection approaches for airborne infrared sequences [9–12] were mainly based on six parameter affine or eight parameter planar projective transformation matrix estimation. However, the transformation matrix estimation in high-noise and low-contrast infrared sequences is still an open problem and has become the bottleneck for the further elevation of moving targets detection performance. As a whole, moving targets detection from airborne infrared sequences has still tremendous room for improvement.

3. PROPOSED MOVING TARGETS DETECTION FRAMEWORK The proposed moving targets detection approach for airborne FLIR sequences is called biologically inspired multilevel moving targets detection framework (BIMLMDF) in the following. As mentioned before, for the moving targets detection task, the ventral stream can be viewed as a semi-autonomous function that operates under visual guidance, which is informed by dorsal stream processing. The dorsal stream is mainly responsible for efficiently extracting motion cues from short-term sequences. The ventral stream is guided under motion cues and owns the object perception function. As drawn in Fig. 1, BIMLMDF mainly includes three sequential modules, namely the motion perception module, the attended motion views extraction module, and the appearance perception module, corresponding to the dorsal stream, the feedforward modulation of the dorsal stream, and the ventral stream. Specifically, the motion perception module based on phase discrepancy focuses on efficiently extracting motion cues, the attended motion views extraction module focuses on coarsely localizing potential moving targets, and the appearance

Motion perception based on phase discrepancy

Attended motion views extraction

dorsal stream

Appearance perception in attended motion views

perception module focuses on accurately detecting moving targets. The three sequential modules of BIMLMDF work as a whole to control gazes so as to minimize the difficulty of moving targets detection. The detail of each module is specifically introduced in the following parts. A. Motion Perception The first module of BIMLMDF, motion perception, as an initialization step to extract motion cues from short-term image sequences, plays an important role. While the interframe difference with motion compensation could be obviously taken as the motion perception result, the aforementioned transformation matrix estimation problem in infrared images makes the interframe difference not the best choice to efficiently perceive motion. Moreover, from the separation perspective of our proposed BIMLMDF, motion perception just plays a role in providing motion cues for the further targets detection, which is the duty of other modules. With the development of biological vision, lots of researchers have shifted their attention to motion saliency detection and a variety of motion saliency computational models [25,26] have been proposed. Due to the calculation efficiency, the phase discrepancy-based motion saliency detection method [26] has attracted wide attention [40–42]. So the phase discrepancy-based motion saliency detection method [26] is utilized to calculate the motion perception result. Taking the low contrast infrared images and the possibly rapid illumination change, caused by the appearance of hot or cold targets, into account, the motion perception result from infrared sequences can be calculated by the following steps. First, normalize the input image Ix; y; t with Eq. (1) proposed in [9] for improving the image contrast and balancing the illumination change:

Ix; y; t 

Ix; y; t − I•; t ; stdI•; t

where I•; t and stdI•; t denote the mean value and standard deviation value of the image, respectively. Though the pixel value may be negative after the normalization mapping by Eq. (1), which will not affect the motion saliency calculation based on phase discrepancy. Second, the motion saliency MSalx; y; t; t  Δ can be calculated by two adjacent frames Ix; y; t and Ix; y; t  Δ using phase discrepancy [26]: MSalx; y; t; t  Δ

ventral stream

 fF−1 jF tΔ ωj − jF t ωj · e−i∠F tΔ ω g2 ;

(a)

(b)

(1)

(c)

Fig. 1. Moving object detection framework inspired by the twostreams hypothesis. From left to right are (a) original image sequences, (b) the attended motion views based on the motion perception result, and (c) the moving object detection results in the local attended motion views.

(2)

where F represents Fourier transform operator, Δ denotes the frame difference distance, and F t ω stands for the Fourier domain of frame Ix; y; t. jF t ωj and ∠F t ω denote the frequency and phase spectrum, respectively. Finally, the motion perception result can be expressed by Eq. (3), for suppressing the estimation error caused by the complex texture background. In our implementation, for the calculation efficiency, the motion perception result is calculated by Eq. (4):

Li et al.

Vol. 31, No. 4 / April 2014 / J. Opt. Soc. Am. A

PMSalx; y;t Z Z 1  min MSalx;y; t; t − kdk; Δ

 Δ MSalx;y; t; t  kdk ;

1

(3) PMSalx; y; t  minMSalx; y; t; t − Δ; MSalx; y; t; t  Δ: (4) The motion perception result extracted from short-term sequences is drawn in Fig. 3(b). B. Attended Motion Views Extraction The second module of BIMLMDF, extraction of the attended motion views, works for indicating the window places of moving targets using the motion perception result. As depicted in Fig. 1, the attended motion views extraction module serves as a critical link between motion perception and appearance perception, and takes charge of generating regions of interest (ROIs) which probably contain the potential moving targets. The attended motion focus and view are successively extracted from the motion perception result for locating the moving targets, similar to the visual saliency process in [43]. As the attended motion view is calculated based on the attended motion focus, we first introduce how to extract the attended motion focus from the motion perception result. Utilizing the inhibition-of-return strategy [44], we could recursively detect the local maximum from the motion perception result as the attended motion focus. However, the inhibition-of-return strategy does not have a recursively

737

terminal condition, which makes the automation of ROIs extraction impossible. Concerning this issue, the number of attended motion focuses can be determined by the following steps. Given the upper-bound number M, we can obtain the attended focus set F  fxi ; yi ; i  1; …; Mg and output the motion perception residual map RPMSalx; y; t using the inhibition-of-return strategy in [44]. Obviously, this step certainly leads to some possible false attended motion focuses generated in the set F if we ensure all the true attended motion focuses to be extracted. For the selection of the upper-bound number M, fixing the upper-bound number M with a constant value may be an apparent approach. However, the optimal upper-bound number M may depend on the testing scenarios, which has been verified in the experimental section. Stimulated by that, an adaptive upper-bound number selection approach is introduced via the local maximum analysis of the motion perception result PMSalx; y; t. Given the motion perception result, depicted in Fig. 2(a), the local maximum map, depicted in Fig. 2(b), can be calculated by three slight operations (i.e., 1, Gaussian smoothing of the motion perception result; 2, dot product of the smoothed motion perception result; 3, local maximum within a 3 × 3 neighborhood keeps and others are set to 0). Furthermore, the following steps are utilized to determine the upper-bound number M. First, we normalize the nonzero local maximums in the local maximum map to be in [1,300]. Then, the local maximum histogram with thirty equal bins is formed without normalization, depicted in Fig. 2(c). Finally, the upper-bound number M can be depicted by the sum of the local maximum occurrence count under one occurrence count threshold histTh, as depicted in Fig. 2(c).

Fig. 2. Adaptive upper-bound selection illustration. (a) The motion perception result, (b) the local maximum map of the motion perception result, and (c) the quantization histogram of the local maximum map.

738

J. Opt. Soc. Am. A / Vol. 31, No. 4 / April 2014

Li et al.

Fig. 3. Intermediate results of our proposed BIMLMDF. The first row denotes the moving targets detection situation in Sequence 1, the second row denotes the moving targets detection situation in Sequence 2, the third row denotes the moving targets detection situation in Sequence 3, and the fourth row denotes the moving targets detection situation in Sequence 4. From left to right are (a) the original image, (b) the motion perception result from short-term image sequences, (c) the initial attended motion focus set, (d) the refined attended motion focus set, (e) the attended motion view set, (f) the mask of the moving targets, (g) the saliency map of moving targets in the mask, and (h) the moving targets detection result.

The rationality behind the proposed adaptive upper-bound number selection approach is that the value of the local maximum corresponding to the background probably distribute consistently; however, the value of the local maximum corresponding to the potential moving targets generally scatter. So the sum of the local maximum occurrence count under histTh can be utilized to estimate the upper-bound number M. The validation of the proposed adaptive upper-bound number selection approach is detailed in the experimental section. As depicted in Fig. 3(c), false attended motion focuses indeed exist. So it is a critical work to exclude the false attended motion focus from the original attended motion focus set F. For any attended motion focus f ∈ F, it is a true one, if and only if f satisfies two constraints: the distribution constraint given by Eq. (5) and the local contrast constraint given by Eq. (6): P

x;y∈Ω1 f  PMSalx; y; t

RΩ1 f 

≥ ub  m × σ b ;

(5)

P PMSalx; y; t RΩ2 f  − RΩ1 f  P x;y∈Ω1 f  ≥n× ; RΩ1 f  PMSalx; y; t x;y∈Ω2 f ∕Ω1 f  (6) where ub and σ b denote the mean value and standard deviation value of the motion perception residual map RPMSalx; y; t. Ω1 f  and Ω2 f  represent the local neighbor of the attended focus point f , and Ω1 f  ⊂ Ω2 f  holds. R· denotes the neighbor metric and can be simplified into the number of the pixels in the neighbor. In our implementation, m and n are generally set to 3 and 1.2, respectively. After the excluding stage, we can get the refined attended motion focus set RF, which satisfies the constraints. The refined attended motion focus set is acted as the final attended motion focus set. As shown in Fig. 3(d), the refined attended motion focuses can be acted as the index to locate moving targets. With the attended motion focus set, we can calculate the attended motion view V x0 ; y0 ; w0 ; h0  corresponding to each

attended motion focus, where x0 ; y0  represents the coordinate of the attended motion focus and w0 ; h0  represents the width and height of the window centered at x0 ; y0 . The attended motion view’s width and height can be computed with Eqs. (7) and (8): w0 

h0 





P

x;y∈Ω3 f 0  PMSalx; y

· x − x0  ; x;y∈Ω3 f 0  PMSalx; y

P

P

x;y∈Ω3 f 0  PMSalx; y

· y − y0  ; x;y∈Ω3 f 0  PMSalx; y

P

(7)

(8)

where f 0  x0 ; y0  ∈ RF denotes the attended motion focus, Ω3 f 0  represents the local neighbor centered at the attended motion focus f 0 and Ω3 f 0  ⊃ Ω2 f 0  ⊃ Ω1 f 0  holds. α ≥ 1 is a constant coefficient and set to 2 in our implementation. As depicted in Fig. 3(e), the attended motion views can entirely cover the potential moving targets. The process of attended motion view extraction can be viewed as the last stage of the motion perception and the start stage of the appearance perception, corresponding to dorsal stream and ventral stream, respectively. C. Appearance Perception The third module of BIMLMDF, appearance perception, copes with the challenges that the intensity appearance of the moving target may be camouflaged and the contrast between targets and background is extremely low, as depicted in Fig. 3(f). In the appearance perception module, a gradient-based method is proposed and utilized to accurately detect moving targets in the local attended motion views. Just to be clear, all the operations in the following are grounded on the mask, which is the union of the attended motion views. In order to enhance the contrast between targets and background, we first calculate the gradient magnitude of the original image using Eq. (9):

Li et al.

Vol. 31, No. 4 / April 2014 / J. Opt. Soc. Am. A

GMx; y 

q I x 2  I y 2 :

(9)

The targets in the local attended motion views can be enhanced by the difference of Gaussian operator by Eq. (10): DMx; y 

X

GMx; y ⊗ Gx; y; kσ − Gx; y; σ;

(10)

σ

where G· represents the Gaussian convolution kernel, σ ∈ f1.6; 2.4; 3.2g stands for the scale factor, and k  1.5 is a constant. Utilizing Eq. (10), the targets in the mask can be entirely enhanced, but it may merge several genuine targets into one when they distribute densely. In order to tackle this problem, we use Eq. (11) as the final saliency map of moving targets in the mask: SMx; y  Diffλ · GMx; y  1 − λ · DMx; y;

(11)

where λ ∈ 0; 1 is a constraint constant and Diff· stands for the anisotropic diffusion operator proposed in [45], which could smooth the background and keep the targets isolated. A large amount of experiments show that the way to determine the threshold value [46] works well on the saliency map SMx; y in the mask. The moving targets can be accurately segmented and located from the saliency map SMx; y with the threshold Th  μ  σ, where μ and σ represent the mean value and standard deviation value of the saliency map, respectively. Based on that, the final moving targets detection result is drawn in Fig. 3(h).

4. EXPERIMENTAL RESULTS In this section, the testing airborne FLIR sequences and evaluation metrics are introduced in Subsection 4.A, the overall performance of our proposed BIMLMDF is depicted in Subsection 4.B, and the comparison result with other existing state-of-the-art approaches is presented in Subsection 4.C. A. Dataset and Evaluation Metrics The FLIR sequences are shot in the airborne platform. The detail of data generation is introduced in the following. The operating wavelength interval of the infrared camera is in the long-wave infrared range, 8–14 μm. With the field-of-view set to 4° × 3°, infrared image sequences with 320 × 256 pixel resolution, are recorded at 25 f/s (i.e., frames/second). In addition, four sequences are selected as the testing dataset. The typical frames from Sequence 1, Sequence 2, Sequence 3, and Sequence 4 can be intuitively shown in the first, second, third, and fourth row of Fig. 3. The character of the testing infrared sequences is depicted in Table 1.

739

The four testing sequences cover different moving targets detection situations from airborne FLIR sequences, so the evaluation on the four testing sequences can fairly demonstrate the validation and performance of the moving targets detection method for airborne FLIR sequences. The key frames are sequentially selected from the original testing sequences with a fixed interval l  10. The moving targets in the key frames are manually labeled by their corresponding compact bounding rectangles, shown in Fig. 6(b), as the ground truth. As the size of the targets to be detected is relatively small, the bounding rectangles can approximate the targets to a certain degree. Let H denote the hit area (the overlapping detected regions between the detection result and the ground truth), i.e., the area belonging to the target and accurately detected; M denote the miss area, i.e., the area belonging to the target but incorrectly missed; and F denote the false alarm area, i.e., the area not belonging to the target but incorrectly detected. The hit rate (HR) and false alarm rate (FAR) can be defined by HR 

H H M

FAR 

F : H F

(12)

A perfect moving targets detection result would make the HR equal to 1 and the FAR equal to 0. B. Overall Performance of Our Proposed BIMLMDF In our implementation, we set the distance of two frames in the temporal dimension to be Δ  3. The local neighbor Ω1 ·, Ω2 ·, and Ω3 · represent 7 × 7, 11 × 11, and 15 × 15 rectangle neighbor, respectively. Apparently, the upper-bound number of the initial attended motion focus set may be related to the number of moving targets appearing in the current frame. So fixing the upper-bound number M on each scenario may be not rational, which is discussed and verified in the following. Here, the upper-bound number M is adaptively determined based on the local maximum analysis of the motion perception result, where histTh is set to 10. How to determine histTh is specifically introduced in the following. Based on this configuration, the intermediate results of each aforementioned module are intuitively depicted in Fig. 3. As mentioned above, the attended motion views extraction serves as a critical link between motion perception and appearance perception. So the upper-bound number M of the original attended motion focus set, in the attended motion views extraction module, may have some impact on the overall performance of our proposed BIMLMDF. In order to analyze the sensitivity of the upper-bound number M to the moving targets detection performance, the HR and FAR are counted under different upper-bound number M. Based on the evaluation metrics introduced before, an optimal upperbound number of the initial attended motion focus set should

Table 1. Three Testing Airborne FLIR Sequences Sequence ID Sequence Sequence Sequence Sequence

1 2 3 4

The Number of Original Frames

The Number of Key Frames

The SNR of Infrared Targets

The Distribution of Infrared Targets

The Moving Speed of the Airborne Platform

517 546 674 896

51 54 67 89

Relatively High Relatively High Low Low

Densely Densely Sparsely Sparsely

Relatively Low High High High

740

J. Opt. Soc. Am. A / Vol. 31, No. 4 / April 2014

Li et al.

Fig. 4. Detection performance under different upper-bound number M. From left to right are (a) the detection performance change curve on Sequence 1, (b) the detection performance change curve on Sequence 2, (c) the detection performance change curve on Sequence 3, and (d) the detection performance change curve on Sequence 4.

make the algorithm obtain a high HR and low FAR. In this paper, the optimal parameter can be determined when the corresponding parameter can make the algorithm get a lowest FAR at the premise that the highest HR has been achieved. Based on the quantitative results depicted in Figs. 4(a)–4(d), the optimal upper-bound number on Sequence 1, Sequence 2, Sequence 3, and Sequence 4 is 20, 10, 15, and 10, respectively. From the quantitative results as depicted in Fig. 4, we can see that the optimal upper-bound number varies in different scenarios. So fixing the upper-bound number with a constant value will lack of generality. So it is necessary to adaptively determine the upper-bound number M. As mentioned before, the upper-bound number M can be calculated based on the local maximum analysis of the motion perception result, where histTh is a threshold to be determined. Instead of explicitly giving the upper-bound number M, the quantitative evaluation results under different histTh are summarized in Fig. 5. Based on the aforementioned criteria of the optimal parameter selection, the optimal value range of histTh on Sequence 1, Sequence 2, Sequence 3, and Sequence 4 is the interval [8,16], [6,12], [8,16], and [6,16], respectively. Furthermore, the overall optimal value range of histTh is the interval [8,12] (i.e., the intersection set of [8,16], [8,16], and [6,16]), which can make BIMLMDF work on the four testing sequences at the optimal status. While the proposed adaptive upper-bound number selection approach still has a parameter histTh to be determined, the parameter histTh belonging to the interval [8,12] can make BIMLMDF work on the testing datasets at the optimal status (in our implementation, histTh is set to 10 for better generality). Therefore, the selection of the parameter histTh is less sensitive to the final detection performance compared with the direct selection of the upper bound. From the comparison result of Figs. 4 and 5, BIMLMDF with the adaptive

upper-bound number selection approach embedded, can get a lower FAR with the same HR on each corresponding sequence compared with hard fixing the upper-bound number. As a whole, the proposed adaptive upper-bound number selection approach is valid and suitable to be embedded to BIMLMDF. C. Comparison Result with Other Existing Methods As mentioned before, the existing moving targets detection approaches in [9–12] can be potentially utilized to implement moving targets detection from the airborne FLIR sequences. As an extension to the motion history image (MHI), the moving targets detection approach by forward backward MHI in [9] has been designed for facilitating moving targets detection from an airborne infrared camera. For clarity, the infrared moving targets detection approach in [9] is called MHI in the following. Benefiting from the MHIs, the MHI in [9] could accurately locate the moving targets and cope with the slow motion detection. So, the MHI in [9] is reimplemented for a comparison with our proposed BIMLMDF. With the consideration that the approach in [10] is a relatively old one, the approach in [11] is specifically proposed for moving targets detection from airborne infrared imagery, and the approach in [12] is mostly a review about infrared moving object detection and tracking, the algorithm in [11] is reimplemented and selected for a comparison with our proposed BIMLMDF. For clarity, the real-time multiple moving targets detection approach by dynamic Gabor filter and dynamic Gaussian in [11] is called RTDD in the following. In order to demonstrate the validation of our proposed BIMLMDF, the MHI in [9] and RTDD in [11] are also evaluated in our proposed infrared testing datasets for comparisons. For a fair comparison, in all algorithms, the step size of the image difference is set to 3. The other parameters are set as

Fig. 5. Detection performance under different histTh. From left to right are (a) the detection performance change curve on Sequence 1, (b) the detection performance change curve on Sequence 2, (c) the detection performance change curve on Sequence 3, and (d) the detection performance change curve on Sequence 4.

Li et al.

Vol. 31, No. 4 / April 2014 / J. Opt. Soc. Am. A

741

Fig. 6. Comparison result with other state-of-the-art moving targets detection methods in the airborne forward-looking infrared (FLIR) sequences. The first row denotes the moving targets detection result in Sequence 1, the second row denotes the moving targets detection result in Sequence 2, the third row denotes the moving targets detection result in Sequence 3, and the fourth row denotes the moving targets detection result in Sequence 4. From left to right are (a) the original image, (b) the ground truth, (c) the moving targets detection result by [9], (d) the moving targets detection result by [11], and (e) the moving targets detection result by our proposed BIMLMDF.

the default configuration of the corresponding algorithm. The bounding rectangles of the moving targets mask output by MHI in [9] and specular highlights region output by RTDD in [11] are taken as the final moving targets detection result. The moving targets detection results output by different algorithms are intuitively shown in Fig. 6. As the four testing sequences denote different moving targets situations from airborne FLIR sequences, the quantitative results on different sequences are dividedly summarized in Figs. 7(a)–7(d). As shown in Fig. 7(a), the MHI proposed in [9] can work well on Sequence 1, which is mainly because Sequence 1 was shot when the airborne platform moves at a relatively

low speed and the displacement of the image background is small. However, the MHI in [9] is unable to perform well on Sequence 2, Sequence 3, and Sequence 4, as depicted in Figs. 7(b)–7(d). As the transform matrix estimation, especially in high-noise and low-contrast infrared images, is an open problem, the forward and backward history image utilized in [9] may accumulate the ego-motion error when there is a large displacement of the image background. RTDD in [11] obtains an overall low HR, as depicted in Fig. 7, because the final moving targets detection result is represented by the specular highlights region generated by RTDD in [11]. Actually, the specular highlights region can

Fig. 7. Comparison with two state-of-the-art moving targets detection methods in the airborne FLIR sequences. From left to right are (a) comparison result on Sequence 1, (b) comparison result on Sequence 2, (c) comparison result on Sequence 3, and (d) comparison result on Sequence 4.

742

J. Opt. Soc. Am. A / Vol. 31, No. 4 / April 2014

Li et al.

work well on localizing the moving targets, but can be inefficient for detecting the whole moving targets with a clear border. As shown in Fig. 6(e), our proposed BIMLMDF not only can accurately localize the moving targets, but entirely detect the moving targets, even when the moving targets distribute densely. Moreover, as depicted in Fig. 7, the quantitative results demonstrate that our proposed BIMLMDF can work well on different moving targets detection situations of an airborne platform and outperform two existing state-of-the-art moving targets detection approaches in the airborne FLIR sequences. In order to demonstrate the robustness of our proposed BIMLMDF, more visual results in different situations (such as the low SNR situation, dense distribution of targets, and varying distance of the airborne imaging system and the interesting target) are illustrated in Fig. 8. Two existing approaches [9,11] and our proposed BIMLMDF are all implemented on a PC with 3 GHz CPU and 4 GB RAM. The aforementioned airborne FLIR sequences with pixel resolution 320 × 256 are utilized to test the average running time of three algorithms. In order to give a fair comparison, three algorithms are all tested using the original resolution 320 × 256 without down-sampling. The average running time of three algorithms are summarized in Table 2. As depicted in Table 2, our proposed BIMLMDF can be competent at the requirement for real-time moving targets detection and automatic tracking systems. From the architecture of our proposed BIMLMDF, the most running time focuses on the motion perception module, which needs a global search. From the separation perspective of the appearance and motion appearance, without a large penalty

Fig. 9. Intermediate results of our proposed BIMLMDF. Corresponding to Fig. 3, the first row, second row, third row, and fourth row denote the moving targets detection situation in Sequence 1, Sequence 2, Sequence 3, and Sequence 4, respectively. From left to right are (a) the original image, (b) the motion perception result from short-term image sequences, (c) the initial attended motion focus set, (d) the refined attended motion focus set and attended motion view set, and (e) the moving targets detection result.

on the detection performance, the running time of our proposed BIMLMDF can be further refined by extracting motion cues in a lower resolution and utilizing the original resolution in other modules. In order to demonstrate the judgment, our proposed BIMLMDF is also configured as follows: the motion perception module is implemented by two down-sampling the original images (i.e., the resolution 160 × 128) and other modules are implemented by the original images (i.e., the resolution 320 × 256). The intermediate results of BIMLMDF are intuitively shown in Fig. 9. From Fig. 9, we can see that the motion perception results still can output a relatively stable attended motion view set. Based on this configuration, the average running time of our proposed BIMLMDF can be promoted 1

HR

1-FAR

0.9 0.8 0.7 0.6 0.5 0.4 0.3

Fig. 8. More visual results of our proposed BIMLMDF. The first row denotes the case that the moving targets to be detected have a relatively low SNR. The second row denotes the case that the moving targets distribute densely. The third row denotes the case that the airborne is gradually close to the targets to be detected.

Table 2. Running Time of Different Algorithms (320 × 256)

Frames/second (f/s)

MHI in [9]

RTDD in [11]

Our Proposed BIMLMDF

Biologically inspired multilevel approach for multiple moving targets detection from airborne forward-looking infrared sequences.

In this paper, a biologically inspired multilevel approach for simultaneously detecting multiple independently moving targets from airborne forward-lo...
1MB Sizes 0 Downloads 3 Views