236

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015

Geodesic Invariant Feature: A Local Descriptor in Depth Yazhou Liu, Pongsak Lasang, Member, IEEE, Mel Siegel, Fellow, IEEE, and Quansen Sun

Abstract— Different from the photometric images, depth images resolve the distance ambiguity of the scene, while the properties, such as weak texture, high noise, and low resolution, may limit the representation ability of the well-developed descriptors, which are elaborately designed for the photometric images. In this paper, a novel depth descriptor, geodesic invariant feature (GIF), is presented for representing the parts of the articulate objects in depth images. GIF is a multilevel feature representation framework, which is proposed based on the nature of depth images. Low-level, geodesic gradient is introduced to obtain the invariance to the articulate motion, such as scale and rotation variation. Midlevel, superpixel clustering is applied to reduce depth image redundancy, resulting in faster processing speed and better robustness to noise. High-level, deep network is used to exploit the nonlinearity of the data, which further improves the classification accuracy. The proposed descriptor is capable of encoding the local structures in the depth data effectively and efficiently. Comparisons with the state-of-the-art methods reveal the superiority of the proposed method. Index Terms— Body parts recognition, pose recognition, depth image, deep learning, and superpixel.

I. I NTRODUCTION

P

HOTOMETRIC local descriptor [1] is one of the most powerful tools for image and video analysis. It has attracted extensive research efforts in recent years, and remarkable progress has been achieved [2]–[7]. Local descriptor encodes the micro-structure or statistical information of a region. Ideally, this description should be at least partially invariant to photometric (changes in brightness, contrast, saturation or color balance) and geometric (mainly affine transformations, like translations, rotations and scale changes) transforms [8]. Therefore, it has a wide variety of applications

Manuscript received June 21, 2014; revised October 13, 2014; accepted November 26, 2014. Date of publication December 4, 2014; date of current version December 16, 2014. This work was supported in part by the Program of Introducing Talents of Discipline to Universities under Grant B13022, in part by the Doctoral Fund through the Ministry of Education, China, under Grant 20133219120033, in part by the Open Project Program through the Jiangsu Key Laboratory of Image and Video Understanding for Social Safety under Grant JSKL201306, and in part by the National Natural Science Foundation of China under Grant 61273251, Grant 61300161, and Grant 61371168. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Dimitrios Tzovaras. Y. Liu and Q. Sun are with the Department of Computer Science and Engineering, Nanjing Institute of Science and Technology, Nanjing 210048, China (e-mail: [email protected]; [email protected]). P. Lasang is with the Panasonic Research and Development Center Singapore, Singapore 639798 (e-mail: [email protected]). M. Siegel is with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2378019

in the fields of object detection and classification [9], [10], texture analysis [11]–[13], image retrieval [14], [15], object tracking [16], [17], and face recognition [18], [19]. Recently, with the rapid development of range sensors, such as Kinect and SwissRanger, 3D depth information can be readily obtained with low cost. These devices use either the structured light or time-of-flight (ToF) to measure the distance between objects and cameras. The images that captured by these devices are known as range/depth images. Since depth image resolves the distance ambiguities which exist in the photometric image, a large number of recent applications have emerged based on it. For instance, 3D scanning and reconstruction [20], [21], pose and action recognition [22]–[25]. Comparing depth images with photometric images, the differences exist in the following aspects. 1) Geometrical structure. Pixels in a depth image indicate calibrated depth in the scene, rather than a measure of intensity or color [26]. Therefore, depth images capture the geometrical structure information and resolve the depth ambiguities, which can greatly simplify some processing step such as background subtraction. 2) Weak texture. In the depth image, the color and texture variations induced by clothing, hair, and skin are not observable. 3) High noise. Comparing with the advanced photometric image sensors, the noise rates of depth sensors are relatively higher, especially in the environments of strong ambient light. 4) Low resolution. The lateral resolution of time-of-flight cameras is generally low compared to the standard 2D video cameras, with most commercially available devices at 320 × 240 pixels or less. Kinect claims its lateral resolution as 640 × 480. With these essential differences, the well-developed local descriptors for the photometric images may not be readily applied for the depth images. For example, scale invariant feature transform (SIFT) [2] descriptor and its variants [4], [9] encode the gradient distribution with respect to the orientations within a local region. In the depth images, the edges and gradients are not as rich and distinct as in the photometric images. Therefore, the corresponding gradient based descriptors may not as informative as well. Local binary pattern (LBP) descriptor [27] and its variants [11] represent the statistics of micro structures of a region. For depth images, the meaningful structures mainly exist at the boundary regions of objects. Therefore, different parts within the objects cannot be differentiated effectively.

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIU et al.: GIF: LOCAL DESCRIPTOR IN DEPTH

Finding a descriptor that can encode the local information of depth images effectively is the motivation and target of this paper. In order to achieve this target, the properties of depth images and their special targeting applications must be considered during the feature design process. The desirable properties of the depth descriptors are summarized as follows: Firstly, the descriptors in the depth should have some invariant properties which are preferable for the specific applications. Secondly, because the depth images are noisy and of weak texture, the units that being processed should be changed from pixels to some middle level representations, which might reduce the processing time and improve the robustness of the descriptor. Thirdly, in order to capture the nonlinear nature of the feature, some high level representations should be discovered to improve the discriminant of the feature. The contribution of this paper lies in the multi-level feature extraction hierarchy for depth images. The above targeting properties have been addressed in the different levels of the hierarchy. Low-level representation encodes the invariance to the articulate motion. An active research topic based on depth image is pose and action recognition [22]–[24]. In this context, most of the interested objects are non-rigid and consist of multiple parts, for instance, human/animal body contains torso and limbs, and human hands contain fingers. Parts of the objects are connected by joints and have multiple degree of freedoms (DoFs). The overall motion patterns of these objects are articulate motions. It is desirable that the descriptors can provide consistent representations of the parts in different poses and gestures. The object is modelled as a graph whose nodes correspond to the pixels on the object and whose edges represent the neighborhood relationship between the pixels. Based on this graph model, the geodetic gradient is introduced to rectify the feature extraction which is supposed to be invariant to the articulate motion as long as the local connection relationship does not change during the motion. Mid-level representation reduces the computation cost. By unsupervised clustering, the pixels of the depth image are grouped into clusters according to their 3D positions in the scene. And the clustering result is referred to as superpixel representation, which is used to replace the rigid structure of the pixel grid. On the one hand, this representation provides a convenient primitive from which to compute image features, which greatly reduce the complexity of subsequent image processing tasks [28]. Especially for the depth image with weak texture, in which only the pixels on the boundaries of objects have the distinct structural information, the superpixel based representation reduces the image redundancy dramatically. On the other hand, since the noise level of depth sensors is higher than photometric sensors’, superpixels can suppress the noise effect of the individual pixels and yield a more robust representation. High-level representation captures the nonlinear information. Because of the complexity of data distribution, nonlinear mapping which maps the data from their original space to some latent feature space is used to achieve better discriminant. Deep learning [29]–[32] is utilized to find high order dependency of the feature, which has been successfully

237

Fig. 1. The processing flow of the proposed method (Best viewed in color). (a) The input depth image. (b) Floor and ceiling detection. (c) The foreground segmentation result. (d) Geodesic distance map of the foreground object. (e) Low-level feature: geodesic invariant feature extraction: (e1) is the isoline map and (e2) is the geodesic gradient map. (f) Mid-level feature: superpixel constrained geodesic invariant feature extraction: (f1) is the superpixel segmentation result and (f2) is the geodesic gradient map of the superpixels. (g) Orientation regularized binary feature calculation. (h) Binary feature strings. (i) High-level feature: depth network for feature mining.

applied in many fields including computer vision [33]–[35], natural language processing and speech recognition [36]–[38]. In this work, the employed deep network is based on stacked denoising autoencoders (SdA), as the feature evolves towards the deeper layers of SdA, two desirable properties for classification have been observed: sparsity and better discrimination. The overall processing flow of the method is shown in Fig. 1. The rest of the paper isstructured as follows. Section II provides a brief summarization of the related works. Section III introduces the proposed geodesic invariant feature (GIF) in detail and Section IV extends GIF to the superpixel based mid-level representation. Section V presents a deep learning based method for feature mining. Experimental results and comparisons are provided in Section VI. Finally, we conclude this work in Section VII. II. R ELATED W ORKS Local descriptors for photometric images have been well studied and remarkable progress has been made. Scale invariant feature transform (SIFT) introduced by Lowe [2] is one of the mostly widely used descriptor. In the context of matching and recognition, its superior performance has been demonstrated in the benchmarking work conducted by Mikolajczyk and Schmid [1]. Ke and Sukthankar [3] presented the PCA-SIFT descriptor which represents local appearance

238

by principal components of the normalized gradient field. Dalal and Triggs [9] used SIFT on the dense image blocks and prosed a histogram of oriented gradients (HOG) descriptor for pedestrian detection. Bay et al. [4] proposed speeded up robust features (SURF) that are faster to compute and match while preserving the discriminative power of SIFT. Another important branch of photometric local descriptor is local binary pattern (LBP) proposed by Ojala et al. [27], which has gained increasing attention due to its simplicity and excellent performance in various texture and face image analysis tasks. Ahonen et al. [19] exploited the LBP for face recognition. Zhang et al. [18] proposed the local Gabor binary pattern for face representation and recognition. Zhao and Pietikäinen [11] proposed the local binary pattern on three orthogonal planes, and used it for dynamic texture recognition. Wang et al. [39] combined LBP with HOG and achieved superior performance for pedestrian detection. Lepetit and Fua [40] and Ozuysal et al. [41] showed that image patches could be effectively classified on the basis of a relatively small number of pairwise intensity comparisons. Based on this observation, Calonder et al. [7] computed a binary descriptor, which is referred to as BRIEF, on the basis of simple intensity difference tests. They showed that BRIEF can yield comparable accuracy as SIFT and SURF, but with much lower computation cost. Rublee el al. [42] presented the ORB descriptor which incorporates the orientation information to the FAST [43] keypoint detector and the BRIEF descriptor. Comparing with the well-developed photometric descriptors, the local representations in depth images are relatively less investigated. Some researchers adapt the above descriptors to the depth map. Yang et al. [44] computed the HOG descriptor from depth motion maps for human action recognition. Zhang et al. [45] calculated the histogram of 3D facets (H3DF) to explicitly encode the 3D shape information from depth maps and used it for hand gesture recognition. Lo and Siebert [46] extended SIFT into the 2.5D domain by concatenating the histogram of the range surface topology types and the histogram of the range gradient orientations. Bayramoglu and Alatan [47] integrated SIFT with the shape index to match the surfaces with different scales and orientations in the range images. Huynh et al. [48] proposed Gradient-LBP (G-LBP) to encode the facial depth information for gender recognition. In the context of human body parts recognition and tracking, two branches of methods have been widely used in the recent literatures. One branch of methods uses random dictionary patch filtering as the descriptor. Plagemann et al. [49] proposed a method for body part detection in ToF images which is based on identifying geodesic extrema on the surface mesh and utilized the patch filtering results as the descriptors, and this descriptor is referred to as Accumulative Geodesic EXtrema (AGEX). Ganapathi et al. [50] extended this work to markerless human pose tracking. In [51], they used Dynamic Bayesian Network (DBN) to model the motion states and extended iterative closest points (ICP) to enforce free space constraint. Demirdjian et al. [52] recovered the human pose by combining the local optimization with global retrieval. Following this paradigm, Baak et al. [53] proposed a data-driven hybrid strategy that combines generative and

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015

discriminative methods. Jiu et al. [54] presented a deep learning based method to utilized the spatial information for feature extraction. Another widely used descriptor also based on pairwise pixel comparison, just like the BRIEF [7] descriptor. Shotton et al. [26] used simple depth pixel comparison features to generate the parallelizable decision forests for body parts classification. Girshick et al. [55] extended this work by replacing decision forests with the regression forests, which can regress the positions of body joints from the depth image directly without the help of intermediate classification results. Shotton et al.’s summarized and compared these two methods in [23]. III. G EODESIC I NVARIANT F EATURE (GIF) The proposed method is inspired by the early works [7], [23], [26] which use the binary strings generated by pairwise comparisons to represent a local patch. In [7], the comparisons were based on the intensity of pixels and in [23] and [26] the comparisons were based on the depth values of pixels. The difference of the proposed method is that the geodesic gradient is used to rectify the pairwise comparison process, which endows the feature with the robustness to the articulate motion. Given a depth image I, I ( p) represents the pixel’s depth value at position p = (x, y)T · DC pc ,r ,F denotes a local descriptor which determined by two parameters: coverage C pc ,r and feature list F · C pc ,r represents the coverage region of a local descriptor within the image, where pc is the center and r is the radius of the coverage region. F = {P1 , . . . , Pn} denotes the list of random feature pairs where Pi = ( pui , pvi ) is a pair of random positions and n is the number of position pairs. The simple comparison function is defined as follows:  1 i f |I ( p)u − I ( pv )| > t (1) τ ( pu , pv ) = 0 other wi se where ( pu , pv ) is a random pair from the list F and t is the comparison threshold. By applying the comparison function τ (·) to the feature list F, a binary string f ∈ {0, 1}n is obtained and serves as the feature vector of the descriptor DC pc ,r ,F, as shown in Fig. 2(a) and (b). In [23] and [26], f is used to train the randomized decision forests and regression forests. Scale Invariance. In the photometric images, scale invariance is commonly obtained by scale-space search for the local maxima (SIFT/SURF) or interest point detection on a scale pyramid of the image (ORB). In the depth images, scale normalization can be handle much easier, since the object’s image scale is inversely proportional to its distance to the camera which is represented by the pixel value. In order to make a descriptor be invariant to the distance variation, the coverage of the descriptor should be constant in the real world space. Based on the knowledge of projection geometry, the radius r of feature coverage C pc ,r on the image should be defined as: α r= (2) I ( pc )

LIU et al.: GIF: LOCAL DESCRIPTOR IN DEPTH

239

Fig. 3. An intuitive example of GIF descriptor: the GIF is represented by the circles (green) on the right hand of Vitruvian man; the feature of [23] and [26] is represented by the circles (blue) on the left hand.

Fig. 2. Using the geodesic gradient to rectify the feature extraction. (a) Comparison between two the points generates one binary feature value. (b) A region is represented by the binary string produced by the multiple random comparisons which have been used in [23] and [26]. (c) Random points generated with respect to the canonical direction Γ. (d) Descriptor contains multiple pair comparisons. (e) ∼ (f) the random point-pairs are covariant with the canonical direction Γ.

where I ( pc ) is the depth value of center pixel pc , and α is a constant determined by the size of the coverage in the real world space and imaging focus. Intuitively, this equation tells that if the object is closer to the camera, the size of the descriptor on the image should become larger and vice versa. Rotation Invariance. In the photometric images, rotation invariance is obtained by rectify the feature extraction by a canonical direction which is covariant with the local patch’s rotation, for example, the direction that corresponds to the maximum histogram bin of gradients (SIFT/SURF) or the direction that points from the patch center to the intensity centroid (ORB). In the context of human body parts recognition in depth images, the geodesic gradient is chosen as the canonical direction which is covariant with the body articulation. The canonical direction is referred to as Γ. Firstly, we are going to introduce how to calculate the feature with the geodesic gradient. The properties of geodesic gradient will be presented in the following parts. By assigning a consistent orientation to each descriptor based on local properties, the descriptor can be represented relative to this orientation and therefore achieve invariance to the rotation [2]. For a geodesic invariance descriptor, its coverage is denoted by C pc ,rΓ , where Γ is used to represent the canonical direction of the descriptor. The random point pairs are generated in the polar coordinates where pc is the origin and Γ is the polar axis, as shown in Fig. 2(c) and (d). Take random point pu for instance, determined by two parameters, the angle θu ∈ [0, 2π) and the distance ru ∈ [0, r ). Since θu represents the relative angle between the point pu and Γ, all the point pairs are covariant with the canonical direction Γ, as shown in Fig. 2(e) and (f). Fig. 3 gives an intuitive example which shows the contribution of the canonical direction Γ. The feature which covers the right hand of the Vitruvian man is the GIF descriptor and the one covers the left hand represents the feature of [23] and [26]. This example shows that the variation introduced by the articulation is canceled out by the canonical direction Γ, therefore the positions of the given point pair are relatively

stable with respect to the local body parts. But for the feature of [23] and [26], this invariance cannot be readily maintained. Calculation of the canonical direction Γ is inspired by the insight that geodesic distances on a surface mesh are largely invariant to mesh deformations and rigid transformations [49]. More intuitively, the distance from the left hand of a person to the right hand along the body surface is relatively unaffected by her/his posture. For the given object, f g represents the foreground object segmentation results as shown in Fig. 1(c). p0 represents the centroid pixel of f g which marked as the red cross in Fig. 1(d). Calculation of canonical direction Γ contains following steps: 1) Build an undirected graph G = (V, E) from f g , where vertex set V consists of all the pixels on f g and edge set consists of the eight neighborhood relationships of f g . The weight of each edge corresponds to the Euclidean distance between two neighbor points. The geodesic distance between two vertices is defined as weights’ sum of the shortest path which can be find efficiently by Dijkstra’s algorithm. 2) The geodesic distance map Id is generated by calculating geodesic distances between the pixels on the f g and p0 , as shown in the first column (from left) of Fig. 4. 3) The pixels with equal geodesic distances to p0 are marked in the isoline map as shown in the second column (from left) of Fig. 4. 4) For each pixel, canonical  can be either cal direction culated as Γ = arctan ∂∂Ixd , ∂∂Iyd or as the direction that pointing along the shortest path obtained from step 1). In this paper, we adopt the second approach and the examples of canonical direction Γ are shown in the fourth column (from left) of Fig. 4. The nature of canonical direction Γ is presented in the right most column of Fig. 4, in which the hand patches of four different poses are shown. Rectified by Γ, the positions of the point pairs are stable with respect to the body parts in different poses. Therefore, the invariance to the articulation can be obtained. It should be noted that the above scale and rotation normalization methods are based on a simplified approximation of complex articulate motion. They can reduce but cannot totally eliminate the variation caused by articulation. For example, if the body connection relationship has changed such as close loop and overlap/occlusion, canonical direction will changed as well. Significant out of plane rotation will affect both of the

240

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015

Fig. 5. Superpixel segmentation results. (a) Ground truth label of the depth data. (b) SLIC superpixel segmentation of the data.

Fig. 6.

Fig. 4. Geodesic invariance feature. (a) ∼ (d) show the invariance of a GIF descriptor under four different poses. The geodesic distance map, isoline map, geodesic gradient map and feature extraction are shown from left to right.

scale and rotation normalization. These conditions are handled by introducing the related samples during the training process. IV. S UPERPIXEL C ONSTRAINED G EODESIC I NVARIANT F EATURE (ScGIF) In this section, we develop the superpixel constrained geodesic invariant feature (ScGIF) and use it as the mid-level representation. The benefit of this mid-level representation is twofold: firstly, the processing speed can be improved by a big margin; secondly, better robustness to the noise can be achieved. The motivation of ScGIF is based on the following observations: 1) The time complexity of the Dijkstra’s algorithm used in section III for building the geodesic distance map is O(|E| + |V |log|V |), where |E| is the number of edges and |V | is the number of vertices in the graph. The processing speed is directly related to the number of pixels on the foreground object f g . Therefore, if the number of pixels can be reduced, the processing speed can be improved. 2) The depth data obtained by the range sensor are noisy. The noise may come from the shadows of the objects, the strong ambient light that overwhelms the IR light or object materials that scatter the IR light. Therefore, the per-pixel feature extraction/classification is prone to be affected by the noise. Based on the above observations, we replace the rigid structure of pixel grid by perceptually meaningful atomic regions, superpixels. The superpixel method we used is based

Feature calculation for the superpixels.

on SLIC [28], and a segmentation example is illustrated in Fig. 5. Original SLIC method clusters the pixels based on their [l, a, b, x, y] components where l, a, and b are the color components in Lab space and x and y are the coordinates of the pixel. In our case, the clustering is performed based on [x, y, z, L] where x, y and z are the coordinates in the real world space and L is the label of pixel. L is optional for the clustering and only used for offline training and evaluation. Using L, we can make sure that the pixels within the same superpixel have consistent labels, as shown in Fig. 5(a) and (b). During online classification, only real world coordinate [x, y, z] is used for superpixel segmentation. For each superpixel, we record the mean depth value of all the pixels that belong to it. The random point pair comparison is replaced by the superpixel pair comparison. As illustrated in Fig. 6, the point pair ( pu , pv ) is mapped to their corresponding superpixel ( pu , pv ), and comparison is carried out between their mean depth values. The canonical direction Γ pointing along to the shortest path towards the object centroid p0 . By SLIC clustering, the unit being processed is changed from pixel to superpixel. For depth frame with VGA size, the foreground objects may contain tens of thousands pixels, but this leads to only a few hundreds of superpixels. Therefore, the processing load is reduced dramatically. In addition, using the mean value to replace the individual pixel depth can further improve the robustness to the noise. These improvements are going to be verified in the experimental section. V. F EATURE M INING T HROUGH D EEP N ETWORK Generally, high dimensional data have complex distribution in their feature space. If the nonlinearity of the data is handled properly, the performance can be improved by a big margin [56]–[59]. There are varieties of approaches to exploit nonlinearity of the data. For example, this task can be handled by the classifiers directly, such as support vector machines (SVM) which use kernel functions to build nonlinear

LIU et al.: GIF: LOCAL DESCRIPTOR IN DEPTH

241

Fig. 8.

Fig. 7. The structure of SdA and its layer outputs. As the feature propagate from the SdA-layer0 feature space to the SdA-layer2, a trend of sparsity can be clearly observed.

classifiers to model the complex data distribution. It can also be handled by using an intermediate stage, between the original feature and the classifier, to map the feature from its original space to some latent feature space where the data may have more compact distribution. Furthermore, a nonlinear classifier can still be used on top of the learned representation. The second approach is also called representation learning or feature learning which is the essence of deep learning. The sequential training manner makes it especially suitable for learning task under the big data environment. In this section, we attempt to exploit the high order nonlinearity of the data by deep networks. Specifically, the employed deep network is based on stacked denoising autoencoders (SdA) [60]–[62]. Through SdA, the data are projected nonlinearly from its original feature space to some latent representations. We refer to these representations as SdA-layerx feature spaces. SdA can eliminate the irrelevant variation of the input data while preserving the discriminant information that can be used for classification and recognition. Meanwhile, the propagation process of the data from the top layers to the deep layers of the SdA generates a series of latent representations with different abstraction abilities. The deeper the layer, the higher level of abstraction. The structure of our SdA deep network is illustrated in Fig. 7(a). It contains five layers: 1 input layer, 3 hidden/SdA layers and 1 output layer. Each layer contains a set of nodes and the nodes between the adjacent layers are fully connected. The number of nodes in the input layer equals to n, which is the number of random pairs. The binary strings of ScGIF are fed directly to the network as the input layer. The number of nodes in the output layer is d which equals to the number of labels.

Specifications of the benchmarking datasets.

The rationale and training details of SdA is beyond the scope of this paper, please refer to [60]–[62] for more details. Since we have claimed that the representations learnt by SdA can eliminate the irrelevant variation of the input data while preserving the information that is useful for the final classification task, it is important to investigate what actually have been learnt through this deep feature hierarchy. We plot the features of different layers of SdA in Fig. 7(b)∼(e), which provide us with some insight of SdA learning. Fig. 7(b) is the binary string of ScGIF descriptor which feed directly to the input layer. Fig. 7(c)∼(e) are the features that have been learnt by SdA layer 0∼2. A very interesting observation is that as the data propagate towards the deeper layers of the network, a trend of sparsity can be clearly observed. Until final layer of SdA, the number of no-zero entries reduced to 331, which accounts for only 16.6% of the 2000-dimentional feature space. Based on the assumption that the deep neural network mimics the hierarchical organization of human visual cortex [54], this trend of sparsity can be explained by the abstraction behavior of the cortex. Please refer to [63] for more details. VI. E XPERIMENTS In this section, we present the details of parameter setting, dataset, and the comparison results with the state-of-the-art methods. Dataset: Three datasets are used for the performance evaluation. The first dataset is a self-captured dataset which is referred to as GIF-14. The depth data are captured using Kinect sensor. Four actors perform different poses and actions. There are 6930 depth frames of VGA size. For each human body, 15 body parts are labeled. In addition, two public datasets SMMC-10 [50] (available at: http://ai.stanford.edu/~varung/cvpr10/) and EVAL [51] (available at: http://ai.stanford.edu/~varung/eccv12/) with different capturing devices and resolutions are also used. The specifications and sample images of these datasets are illustrated in Fig. 8. It should be noted that the two public datasets do not contain the joint labels that suitable for our classification setup. Therefore, we generate the ground truth

242

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015

TABLE I T HE S PECIFICATION OF THE E VALUATION C RITERIA

TABLE II T HE PARAMETER S ETTING FOR THE D EEP N ETWORK

Fig. 9. The overall accuracy under different parameter setting. (a) Overall accuracy versus feature dimensions. (b) Overall accuracy versus feature coverage size.

labels by the markers’ positions provided by the datasets, as shown in the right most column of the Fig. 8. For each dataset, we randomly select 50% frames for training, 10% frames for validation, and the rest 40% frames for testing. Parameter evaluations are mainly based on GIF-14 dataset and the comparisons with the baselines are performed on all the three datasets. Baselines: The first baseline descriptor that we used is the accumulative geodesic extrema descriptor proposed by Christian et al. [49], and it is referred to as AGEX. The second baseline descriptor is presented by Shotton et al. [23], [26] which is a variant of the BRIEF [7] descriptor in the depth image. Therefore, this descriptor is referred to as BRIEFd. The geodesic invariant feature presented in Section III is denoted as GIF, the superpixel constrained geodesic invariant feature in Section IV is denoted as ScGIF, and the feature obtained by deep learning in Section V is denoted as ScGIF+SdA. Criteria: Both the confusion matrix and ROC curve are used to evaluate the performance of the methods. The involved criteria such as overall accuracy (OA), average accuracy (AA), true positive rate (TPR, y-axis of ROC curve), false positive rate (FPR, x-axis of ROC curve) and the elements in the confusion matrix are specified in TABLE I. Parameters: The structure of the deep network SdA is slightly complex and there are more parameters involved. We specify the details of these parameters in TABLE II. All the experiments in this paper follow these settings unless otherwise specified. The selection of other important

parameters are going to be detailed in the following subsection. The deep learning codes are implemented based on the Python lib Theano [32]. A. Parameter Selection Besides the deep learning part, the overall framework of the proposed descriptor is simple. There are two important parameters that might affect the performance critically. The first one is the number of random pairs n for each descriptor and the second one is the coverage size r of the descriptor. For the number of random pairs n, we test five settings as {128, 256, 512, 1024, 2048}. Each setting are performed 5 times with different random feature lists. The mean overall accuracies and standard deviations of the four methods are illustrated in Fig. 9(a). From these results, two observations can be obtained. 1) As the size of feature list increase, all descriptors’ performances increase monotonically and their standard deviations become smaller as well. But when the feature size reach 512 and above, the performance become stable. 2) Starting from BRIEFd descriptor, as we add more constrains to the descriptor, essential performance improvements are observed, especially after the superpixel constrain and SdA feature mining. The evaluation of coverage size is illustrated in the Fig. 9(b). The coverage size is presented in the real world coordinates and using cm as the unit. The error bars are obtained by

LIU et al.: GIF: LOCAL DESCRIPTOR IN DEPTH

243

range of [−0.15, 0.15] which is the initialization interval according to Equ. (3). 2) As the layer of SdA goes deeper, the ratios of zero weights are increased from 55.4% to 78.1%. This indicates deeper layer has better sparseness. These two observations further verify the sparseness observation shown in Fig. 7. C. Contribution of Canonical Direction Rectification

Fig. 10. The weights distribution of different layer of SdA. (a) According to the research of [64], all the weights are initialized randomly within the range of [−0.15, 0.15]. (b)∼(d) The weights distribution from SdA layer 0∼2.

using different random feature list as well. Intuitively, larger coverage size can include more context information, which is very helpful for differentiating the body parts such as left and right hands. Maximum accuracy is obtained at 65 cm, after that, the performance start to drop slightly. One possible explanation is that if the coverage region is too big, the classifier may overfit to the context and the contribution of the central pixel becomes less important. Similar observation has been reported in [26]. For the following experiments, we use n = 2000 and r = 65 unless otherwise specified.

To highlight performance improvement obtained by using the canonical direction, we compare the BRIEFd and GIF in more details. Both of these two methods are pixel-wise descriptors and the only difference between them is with or without canonical direction rectification. Random forests are learnt as the classifiers which contain 3 random trees and the maximum level of each tree is 20. The confusion matrices are illustrated in Fig. 11, in which (a) is the confusion matrix of BRIEFd and (b) is the results of GIF. Average accuracy is increased from 77.7% to 82.9% and overall accuracy is increased from 80.0% to 84.8%. From these results, two observation can be obtained. 1) With the help of canonical direction, about 5% improvement can be achieved for both overall accuracy and average accuracy. 2) The improvements for parts on the limbs are higher than the parts on the torso. A possible explanation is that the articulate variation are more prominent on the limbs, and canonical direction can counteract these variations effectively. The classification examples are presented in Fig. 12. Visually, the classification results of GIF are better than the BRIEFd’s.

B. Sparseness Analysis of SdA In Section V, we present the sparseness of the SdA. In this subsection, we are going to further verify this experimentally based on the statistics of SdA networks. According the research of Xavier and Yoshua [64], the weights of SdA are initialized uniformly within the interval     6 6 (3) ,4 −4 n in + n out n in + n out where n in is the number of units in the (i − 1)-th layer, and n out is the number of units in the i -th layer. This initialization ensures that, early in training, each neuron operates in a regime of its activation function where information can easily be propagated both upward (activations flowing from inputs to outputs) and backward (gradients flowing from outputs to inputs). The distribution of the initialized weights is shown in Fig. 10(a). After training, we select a node randomly from each layer of SdA and plot the distribution of its input weights, as illustrated in Fig. 10(b)∼(d). From these results, two observations can be obtained. 1) As we’ve assumed, for each node, most of its input weights equal or almost equal to zero. For all of the layers, more than 50% of the weights fall in to the

D. Comparison With the State-of-the-Art Methods Firstly, we evaluate the performance improvement contributed by superpixel and SdA feature mining. Their corresponding confusion matrices are presented in Fig. 11(c) and (d). The visualization results are illustrated in Fig. 13. For the superpixel evaluation, the feature vector of ScGIF is still a binary string, and we use random forest as the classifier. Since the number of superpixels is much smaller than the number of pixels, the maximum depth of the trees in the forest is reduced to 15, the other parameters are kept same as GIF of previous section. Comparing the results of GIF presented in Fig. 11 (b) and the results of ScGIF in Fig. 11(c), more than 5% improvements for both average accuracy and overall accuracy have been achieved. The overall accuracy is increased from 84.8% to 90.4% and average accuracy in increased from 82.9% to 88.4%. In addition, we can find blocks with higher scores are more compactly distributed around the diagonal of the matrix. This means the misclassification occurs mainly between the adjacent body parts, such as the hands and elbows, the feet and the knees. The descriptor that combine ScGIF and SdA feature mining is denoted as ScGIF+SdA. As illustrated in Fig. 7(e), the

244

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015

Fig. 11. Confusion matrix of the classification results: (a) Confusion matrix obtained using the BRIEFd feature. (b) Confusion matrix of the GIF feature. (c) Confusion matrix obtained using ScGIF. (d) Confusion matrix obtained using ScGIF + SdA.

Fig. 12. Comparison of per-pixel classification results: (a) Per-pixel classification results using the feature BRIEFd. (b) Per-pixel classification results using GIF.

learnt feature by SdA is no longer a binary string. It becomes a sparse float type feature vector instead. The random forest is not readily applicable for this feature vector. Therefore, we use the simple logistic regression model as the classifier, which is an integrated part of the SdA model and serves the supervised learning phase of SdA. Comparing the results of

Fig. 11(c) and (d), we can see another 2% improvement is obtained through feature mining. The visualized comparison of these two methods are presents in Fig. 13. The first row (a) shows the ground truth label of the data. Since the label information is obtained by hand label and nearest clustering, there are some mislabel regions in the ground truth. For example, the right hand region of the second image (marked as yellow) is not perfectly labeled. The second row (b) contains the superpixel segmentation results of the ground truth data. The third and fourth rows are the superpixel classification results of ScGIF and ScGIF+SdA. Comparing with the results in Fig. 12, both of these superpixels based methods yield visually more satisfiable results. This verifies one of our assumptions about the superpixel that is superpixels can suppress the noise effect of the individual pixels and yield a more robust representation. The comparison results (ROC curves) with the baselines on the three datasets are presented in Fig. 14. Regarding the AGEX method [49], since we are working on the classification

LIU et al.: GIF: LOCAL DESCRIPTOR IN DEPTH

Fig. 13.

245

The superpixel classification results.

task, we use their patch based descriptor without the geodesic EXtrema detector. The random forest is used as the classifier and the training process is very similar to BRIEFd and GIF. The only difference is that we replace random pair comparison with the random dictionary patch filtering. Following the setting in [65], 2000 random sub-patches for each joint are randomly generated and used as the dictionary. From this result, we have following observations: 1) Among all the three datasets, the relative ranks of the tested methods are stable. The proposed GIF descriptor and its two variants can outperform the other baseline methods. The performance of the dictionary patch matching based descriptor (AGEX) is lower than the others. One possible explanation is that the dictionary patches for the depth data are lack of textures and cannot provide enough discrimination information [65]. 2) Comparing the superpixel based methods (ScGIF, ScGIF+SdA) with the pixel based methods (BRIEFd and GIF), prominent improvements have been observed on all the three datasets. This indicates that superpixel is powerful for resenting the depth data. 3) SMMC-10 [50] dataset was captured by ToF camera. Comparing with the other two datasets that captured by Kinect, the image resolution is relatively low and the noise level is high, as shown in Fig. 8. This might be the main reason that leads to the performance decrease for all the tested methods. But the proposed methods (ScGIF, ScGIF+SdA) still yields satisfiable results. In addition, we analyze the efficiency of the proposed methods. The accuracy versus run time figure is illustrated

Fig. 14. Comparisons with the state-of-the-art methods: (a) Comparison results on the GIF-14 dataset; (b) Comparison results on the SMMC-10 [50] dataset; (c) Comparison results on the EVAL [51] dataset.

in Fig. 15. Here, only classification time is considered and the foreground segmentation time is not taken into account. The testing platform is Intel i7 3.7G processor with 32G RAM. BRIEFd is the fastest one since there are only simple pixel comparison involved in the evaluation. Superpixel is critical for the speed improvement. It increase the speed of GIF from 3.7 fps to 30 fps (ScGIF). This verifies another assumption about the superpixel: the superpixel based representation can reduce the image redundancy and improve the processing speed dramatically. The accuracy winner ScGIF+SdA runs at

246

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015

Fig. 15.

Run time comparisons.

15 fps on a GTX 780i GPU using Python lib Theano [32]. (only the SdA part is running on GPU, superpixel is still on CPU). Since the evaluation process of the deep networks has high parallelism, the GPU time (including copy the data from host to device) for each frame is only 28.7 ms. VII. C ONCLUSION In this work, we presented a geodesic invariant feature and its two variants to encode the local structure of the depth data. These new descriptors were applied in the context of human body parts recognition. Specially, the proposed descriptors form a multi-level feature extraction hierarchy: pixel based low-level representation addressed the articulation variation of human motion by introducing the canonical direction to rectify the feature extraction process; superpixel based mid-level representation replaced the rigid structure of the pixel grid by perceptually meaningful atomic regions which reduced the computation cost and improve the robustness to the noise; high-level feature exploit the nonlinearity of the data by deep networks which further improve the performance of the descriptor. We compare the proposed method with the state-of-the-art methods. Encouraging results have been observed. The proposed method can achieve superior classification accuracy and visual quality. Experiments also reveal following conclusions: first, the canonical direction improved the invariance to the articulation, especially for the body parts that far from the torso, such as hands and feet; second, superpixel is an efficient representation of the depth data, which can reduce the effect of noise and improve the processing speed; third, deep learning can find the nonlinear dependency of the data, by projecting the feature from its original space to some latent sparse feature space, better representation ability can be obtained. R EFERENCES [1] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1615–1630, Oct. 2005. [2] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [3] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun./Jul. 2004, pp. II-506–II-513.

[4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Comput. Vis. Image Understand., vol. 110, no. 3, pp. 346–359, 2008. [5] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary robust independent elementary features,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 778–792. [6] J. Chen et al., “WLD: A robust local image descriptor,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1705–1720, Sep. 2010. [7] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and P. Fua, “BRIEF: Computing a local binary descriptor very fast,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1281–1298, Jul. 2012. [8] E. Valle, “Local descriptor matching for image identification systems,” Ph.D. dissertation, Dept. Doctorate School Sci. Eng., Univ. CergyPontoise, Cergy-Pontoise, France, 2008. [9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2005, pp. 886–893. [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, Sep. 2010. [11] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 915–928, Jun. 2007. [12] J. Chen, G. Zhao, M. Salo, E. Rahtu, and M. Pietikäinen, “Automatic dynamic texture segmentation using local descriptors and optical flow,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 326–339, Jan. 2013. [13] X. Hong, G. Zhao, M. Pietikäinen, and X. Chen, “Combining LBP difference and feature correlation for texture description,” IEEE Trans. Image Process., vol. 23, no. 6, pp. 2557–2568, Jun. 2014. [14] R. Rahmani, S. A. Goldman, H. Zhang, S. R. Cholleti, and J. E. Fritts, “Localized content-based image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1902–1912, Nov. 2008. [15] X. Shen, Z. Lin, J. Brandt, and Y. Wu, “Detecting and aligning faces by image retrieval,” in Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 2013, pp. 3460–3467. [16] D.-N. Ta, W.-C. Chen, N. Gelfand, and K. Pulli, “SURFTrac: Efficient tracking and continuous object recognition using local feature descriptors,” in Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 2009, pp. 2937–2944. [17] M. Subrahmanyam, R. P. Maheshwari, and R. Balasubramanian, “Local maximum edge binary patterns: A new descriptor for image retrieval and object tracking,” Signal Process., vol. 92, no. 6, pp. 1467–1479, 2012. [18] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang, “Local Gabor binary pattern histogram sequence (LGBPHS): A novel non-statistical model for face representation and recognition,” in Proc. 10th IEEE Int. Conf. Comput. Vis., Oct. 2005, pp. 786–791. [19] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 12, pp. 2037–2041, Dec. 2006. [20] S. Izadi et al., “KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera,” in Proc. 24th ACM Symp. User Inter. Softw. Technol., 2011, pp. 559–568. [21] R. A. Newcombe et al., “KinectFusion: Real-time dense surface mapping and tracking,” in Proc. 10th IEEE Int. Symp. Mixed Augmented Real., Oct. 2011, pp. 127–136. [22] M. Ye and R. Yang, “Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera,” in Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 2014, pp. 2353–2360. [23] J. Shotton et al., “Efficient human pose estimation from single depth images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2821–2840, Dec. 2013. [24] J. Lallemand, O. Pauly, L. Schwarz, D. Tan, and S. Ilic, “Multi-task forest for human pose estimation in depth images,” in Proc. Int. Conf. 3D Vis., Jun./Jul. 2013, pp. 271–278. [25] T. Helten, A. Baak, G. Bharaj, M. Müller, H.-P. Seidel, and C. Theobalt, “Personalization and evaluation of a real-time depth-based full body tracker,” in Proc. Int. Conf. 3D Vis., 2013, pp. 279–286. [26] J. Shotton et al., “Real-time human pose recognition in parts from single depth images,” in Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1297–1304. [27] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, Jul. 2002.

LIU et al.: GIF: LOCAL DESCRIPTOR IN DEPTH

[28] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, Nov. 2012. [29] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013. [30] I. Arel, D. C. Rose, and T. P. Karnowski, “Deep machine learning— A new frontier in artificial intelligence research [research frontier],” IEEE Comput. Intell. Mag., vol. 5, no. 4, pp. 13–18, Nov. 2010. [31] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009. [32] J. Bergstra et al., “Theano: A CPU and GPU math compiler in Python,” in Proc. Python Sci. Comput. Conf., 2010, pp. 1–7. [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates, Inc., 2012. [34] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013. [35] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. L. LeCun, “Learning convolutional feature hierarchies for visual recognition,” in Advances in Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates, Inc., 2010. [36] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “Joint learning of words and meaning representations for open-text semantic parsing,” in Proc. Int. Conf. Artif. Intell. Statist., 2012, pp. 127–135. [37] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning, “Dynamic pooling and unfolding recursive autoencoders for paraphrase detection,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2011. [38] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, “Semi-supervised recursive autoencoders for predicting sentiment distributions,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2011, pp. 151–161. [39] X. Wang, T. X. Han, and S. Yan, “An HOG-LBP human detector with partial occlusion handling,” in Proc. Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 32–39. [40] V. Lepetit and P. Fua, “Keypoint recognition using randomized trees,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 9, pp. 1465–1479, Sep. 2006. [41] M. Ozuysal, M. Calonder, V. Lepetit, and P. Fua, “Fast keypoint recognition using random ferns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 448–461, Mar. 2010. [42] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 2564–2571. [43] E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in Proc. 9th Eur. Conf. Comput. Vis., 2006, pp. 430–443. [44] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in Proc. 20th ACM Int. Conf. Multimedia, 2012, pp. 1057–1060. [45] C. Zhang, X. Yang, and Y. Tian, “Histogram of 3D Facets: A characteristic descriptor for hand gesture recognition,” in Proc. 10th IEEE Int. Conf. Autom. Face Gesture Recognit., Apr. 2013, pp. 1–8. [46] T.-W. R. Lo and J. P. Siebert, “Local feature extraction and matching on range images: 2.5 D SIFT,” Comput. Vis. Image Understand., vol. 113, no. 12, pp. 1235–1250, 2009. [47] N. Bayramoglu and A. A. Alatan, “Shape index SIFT: Range image recognition using local features,” in Proc. 20th Int. Conf. Pattern Recognit., Aug. 2010, pp. 352–355. [48] T. Huynh, R. Min, and J.-L. Dugelay, “An efficient LBP-based descriptor for facial depth images applied to gender recognition using RGB-D face data,” in Proc. ACCV Workshop, 2012, pp. 133–145. [49] C. Plagemann, V. Ganapathi, D. Koller, and S. Thrun, “Real-time identification and localization of body parts from depth images,” in Proc. Int. Conf. Robot. Autom., May 2010, pp. 3108–3113. [50] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun, “Real time motion capture using a single time-of-flight camera,” in Proc. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 755–762. [51] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun, “Real-time human pose tracking from range data,” in Proc. 12th Eur. Conf. Comput. Vis., 2012, pp. 738–751. [52] D. Demirdjian, L. Taycher, G. Shakhnarovich, K. Grauman, and T. Darrell, “Avoiding the ‘streetlight effect’: Tracking by exploring likelihood modes,” in Proc. Int. Conf. Comput. Vis., 2005, pp. 357–364.

247

[53] A. Baak, M. Müller, G. Bharaj, H.-P. Seidel, and C. Theobalt, “A datadriven approach for real-time full body pose reconstruction from a depth camera,” in Proc. Int. Conf. Comput. Vis., 2011, pp. 1092–1099. [54] M. Jiu, C. Wolf, G. Taylor, and A. Baskurt, “Human body part estimation from depth images via spatially-constrained deep learning,” Pattern Recognit. Lett., vol. 50, pp. 122–129, Dec. 2014. [55] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon, “Efficient regression of general-activity human poses from depth images,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 415–422. [56] H. Hoffmann, “Kernel PCA for novelty detection,” Pattern Recognit., vol. 40, no. 3, pp. 863–874, 2007. [57] Y. Li, S. Gong, and H. Liddell, “Recognising trajectories of facial identities using kernel discriminant analysis,” Image Vis. Comput., vol. 21, nos. 13–14, pp. 1077–1086, 2003. [58] S. Mika, G. Ratsch, J. Weston, B. Scholkop, and K.-R. Muller, “Fisher discriminant analysis with kernels,” in Proc. IEEE Neural Netw. Signal Process., Aug. 1999, pp. 41–48. [59] B. Scholkopf, A. Smola, and K.-R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319, 2006. [60] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2007. [61] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1096–1103. [62] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Mar. 2010. [63] M. A. Goodale and A. D. Milner, “Separate visual pathways for perception and action,” Trends Neurosci., vol. 15, no. 1, pp. 20–25, 1992. [64] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010, pp. 249–256. [65] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing features: Efficient boosting procedures for multiclass object detection,” in Proc. IEEE Comput. Vis. Pattern Recognit., Jun./Jul. 2004, pp. II-762–II-769.

Yazhou Liu received the B.S. degree in mechanical engineering from Harbin Engineering University, Harbin, China, in 2002, and the M.E. and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, in 2004 and 2009, respectively. Since 2011, he has been a faculty member with the Department of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. From 2009 to 2011, he was a Post-Doctoral Research Fellow with the Machine Vision Group, Oulu University, Oulu, Finland, and also an Engineer with the Panasonic Research and Development Center Singapore, Singapore, from 2007 to 2009.

Pongsak Lasang (M’10) received the B.E. (Hons.) degree in electronics and telecommunication engineering and the M.E. degree in electrical engineering from the King Mongkut’s University of Technology Thonburi, Bangkok, Thailand, in 2005 and 2006, respectively. In 2006, he joined the Panasonic Research and Development Center Singapore, Singapore, as a Research and Development Engineer. He has been working on camera processing and 3D related algorithms design. He received the 1st Place Winner Best Paper Award at ICCE2010 paper (with co-authors). His research interests include multiview image/video processing, depth map estimation and 3D rendering, high-dynamic range imaging, digital camera image processing pipeline, and computational photography. He is a member of the Association for Computing Machinery.

248

Mel Siegel (F’83) received the Ph.D. degree in physics from the University of Colorado, Boulder, CO, USA. He is currently a faculty member in Robotics and an affiliated faculty member in Human Computer Interaction with Carnegie Mellon University, Pittsburgh, PA, USA. His research interests are in sensing, sensors, perception, and display systems in robotics contexts. He has done extensive research in sensors for robot proprioception and environmental awareness, robots for sensing missions, 3D-stereoscopic display systems, and scaling, power, and energy issues in robotics. In addition to his research and teaching, he also directs the Master of Science in Robotics Technology program. He has been an Active Member and an Officer of the Instrumentation and Measurement Society.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 1, JANUARY 2015

Quansen Sun received the Ph.D. degree in pattern recognition and intelligence systems from the Nanjing University of Science and Technology, Nanjing, China, in 2006, where he is currently a Professor with the Department of Computer Science. He visited the Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong, in 2004 and 2005, respectively. His current research interests include pattern recognition, image processing, remote sensing information system, and medicine image analysis.

Geodesic invariant feature: a local descriptor in depth.

Different from the photometric images, depth images resolve the distance ambiguity of the scene, while the properties, such as weak texture, high nois...
4MB Sizes 6 Downloads 3 Views