This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 1

Layered Compression for High Precision Depth Data Dan Miao, Jingjing Fu, Yan Lu, Shipeng Li, Fellow, IEEE, and Chang Wen Chen, Fellow, IEEE

Abstract—With the development of depth data acquisition technologies, access to high precision depth with more than 8-bit depths has become much easier and determining how to efficiently represent and compress high precision depth is essential for practical depth storage and transmission systems. In this paper, we propose a layered high precision depth compression framework based on an 8-bit image/video encoder to achieve efficient compression with low complexity. Within this framework, considering the characteristics of the high precision depth, a depth map is partitioned into two layers: the most significant bits (MSBs) layer and the least significant bits (LSBs) layer. The MSBs layer provides rough depth value distribution, while the LSBs layer records the details of the depth value variation. For the MSBs layer, an error-controllable pixel domain encoding scheme is proposed to exploit the data correlation of the general depth information with sharp edges and to guarantee the data format of LSBs layer is 8-bit after taking the quantization error from MSBs layer. For the LSBs layer, standard 8-bit image/video codec is leveraged to perform the compression. The experimental results demonstrate that the proposed coding scheme can achieve real-time depth compression with satisfactory reconstruction quality. Moreover, the compressed depth data generated from this scheme can achieve better performance in view synthesis and gesture recognition applications compared to conventional coding schemes thanks to the error control algorithm. Index Terms—Depth compression, layered compression, high precision depth.

I. I NTRODUCTION As a typical representation of physical information, depth data has attracted more research interest with respect to the fast development of depth data acquisition technologies and its potential impact on 3D related applications. So far, various depth sensors have been brought forth to record spatial information in the physical world in real-time, such as timeof-flight (TOF) cameras [1], stereo cameras [2] and structure light cameras [3]. With the evolution of the depth sensor, the quality of the generated depth map continues to improve the depth precision. For instance, the depth value range of the new depth sensor “Kinect” is up to 4000, and its pixel value is represented by a 12-bit integer in the Windows SDK [3]. TOF camera data also has a depth range of 0 to 2000. As a result, the high precision depth with more than 8 bit depths can easily be accessed by depth sensors for the benefit of depth assisted applications. D. Miao was with Microsoft Research Asia, Beijing, China. He is now with the University of Science and Technology of China, Hefei, Anhui, 230026, China (e-mail:[email protected]). J. Fu, Y. Lu and S. Li are with Microsoft Research Asia, Beijing, 100190, China (e-mail:jifu,yanlu,[email protected]). C. W. Chen is with the Department of Computer Science and Engineering, University at Buffalo, State University of New York, Buffalo, NY 14260-2500, USA (e-mail:[email protected]).

The data size of the high precision depth data is larger than the normal depth data. Therefore, an efficient compression scheme is essential for the high precision depth storage and transmission. In some scenarios (e.g. immersive telepresence, remote depth acquisition), depth data has to be transmitted through a network in real-time, which leads to even higher requirements on compression complexity and efficiency. Depth compression has been studied for years with respect to efficient depth representation and transmission. One straightforward approach is to encode depth map/sequence using conventional image/video compression schemes [4]. Considering that the depth data contains smooth areas partitioned by sharp edges, with very limited texture, some depth compression approaches [5–9] are proposed in an attempt to achieve more efficient depth compression based on the existing image/video coding framework, such as platelet based coding [5, 6] or edge based coding [7–9]. In platelet based coding schemes, the depth frame is segmented into regions with different sizes. Each region is encoded by utilizing the homogeneousness of the depth map. For example, Marvon et al. [5] segment the depth using a quadtree decomposition, in which each block is modeled using one of three pre-defined piecewise linear functions. Milani and Calvagno [6] partition the depth map based on graph-based image segmentation. In an attempt to reduce the bit cost caused by non-zero high frequency transform coefficients in the edge block, Shen et al. [7] propose an edge-adaptive transform, which avoids filtering cross edges by encoding edge positions explicitly. In [8, 9], shape-adaptive wavelets are employed to ensure that the support for the wavelet lies in the same region separated by edges. Unfortunately, most of these coding schemes are designed for the normal depth data with 8 bit depths and usually implemented based on 8-bit conventional image/video codecs. The data property of high precision depth data is not substantially considered in these schemes. With the development of depth sensors, approaches [10– 12] for the high precision depth compression are proposed. Mehrotra et al. [10] proposed a near lossless depth compression to encode the scaling reciprocals of the Kinect depth values pixel by pixel. Although significant data size reduction is achieved, the strong relationships among the neighboring frames are barely considered. In [12], Fu et al. proposed a Kinect-like depth compression scheme to recover the data correlation destroyed by the noises in temporal and spatial domains. To fully exploit the data correlation, this scheme is implemented based on the extension version of standard video codec. As we known, with the high profiles defined in fidelity range extension, the standard video codec can offer an

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 2

TABLE I DESCRIPTION OF THE TEST SEQUENCES

Sequence Cubicle MovCam Player one Talking Dude Laurana

The first frame of each depth sequence used in this paper, top: Cubicle, Movcam, Player one, bottom: Talking, Dude, Laurana. Fig. 1.

extension of bit-depth up to more than 8 bits, such as 14 bits in H.264/AVC and 12 bits in HEVC. However, optimization of encoding implementation at more than 8-bit depths has rarely been achieved. Most optimized implementations of the standard codec are based on the 8-bit depths data format, including software (e.g., x264, FFmpeg) and hardware. Thus, the limitation on the implementation leads to the high computation complexity for the high precision depth compression. To fully utilize the optimized 8-bit codec, layered compression is an efficient approach for the high precision depth compression. In the traditional high dynamic range (HDR) image/video compression, tone mapping based schemes [13, 14] are widely utilized to convert the high bit-depth image/video to a base layer image/video with 8 bits and an enhancement layer. However, the data format of the high precision depth data is much different from the high dynamic range texture image which is usually represented by floating-point radiance information [15–17]. Moreover, the physical meaning of depth map is also different from the texture image. Thus, the HDR coding schemes designed to minimize the visual appearance change during the compression cannot be employed in high precision depth coding. In the multiview image/video coding, layer-based representation is also an efficient approach [18] to exploit the correlation among the multiple views in which the multiview data is partitioned into homogenous layers and each layer is a collection of Epipolar plane image (EPI) lines modeled by a constant depth plane. However, this layered compression scheme is designed for the multiple texture images and cannot be employed in depth data compression. In our previous work [19], we proposed a layered compression scheme leveraging the standard 8-bit codec which can achieve high coding efficiency on Kinect depth. In this paper, we further explore the layered coding framework on the general high precision depth data captured from the distinct depth sensors and optimize the coding performance based on a ratedistortion model. Based on the characteristics of high precision depth data, we propose a layered compression framework for high precision depth to achieve both high efficiency and low complexity in compression. In this framework, the depth data is first partitioned into two layers in the bit-depth level: the most significant bits (MSBs) layer which contains the multiple highest bit depths with general depth information of the objects and the least significant bits (LSBs) layer which

Depth sensor Kinect Kinect Kinect Kinect Stereo camera Stereo camera

Descriptions Fixed camera captures static cubicle Moving camera scans the cubicle One guy playing Kinect game Two guys talking on the sofa One robot is walking One statue is rotating

contains the multiple lowest bit depths with tiny variations in depth value on surfaces. In terms of the data properties, distinct coding schemes are designed for each layer. For the MSBs layer, a pixel domain coding scheme is designed to fully exploit spatial correlation among multiple surfaces with sharp edges. For the LSBs layer, a transform-domain coding scheme is employed. An error control algorithm is proposed to guarantee the data format in the LSBs layer is 8-bit, so that the optimized standard image/video codecs can be integrated directly. The experimental results show that the proposed scheme can achieve real-time depth compression with high coding efficiency. Moreover, we investigate the impact of depth compression on the performance of view synthesis and gesture recognition and observe that better performances can be acquired using the depth map reconstructed from the proposed scheme in comparison with the conventional scheme. The main contributions of this framework can be summarized as follows: 1) We propose a layered representation form of high precision depth by simple data partition in the bitdepth level to support a multi-grained description of depth. 2) The layered coding scheme can leverage 8-bit standard codec. Most of 8-bit image/video encoders including software and hardware can be integrated easily to accelerate the coding processing. 3) The proposed coding scheme can achieve high coding efficiency and low coding complexity with desired depth feature preserved for relevant applications and can be integrated into a real-time remote depth acquisition system. The rest of the paper is organized as follows. In Section II, we provide data analysis on high precision depth data and then propose the layered compression framework. Section III describes the compression scheme of each layer in detail. The optimization of the bit-depth level partition is studied in Section IV. The experimental results are presented in Section V. Section VI concludes this paper. II. M OTIVATION AND THE PROPOSED FRAMEWORK In this section, we first provide a data analysis of high precision depth, which indicates the motivation of the proposed framework, and then introduce the layered depth coding framework. A. Motivation based on data analysis For the high precision depth data, even though the pixel value range is huge for the overall depth map, the local range in most of regions is small. To investigate the data characteristic, we measure the pixel range for each 16x16 block in a depth map by calculating the difference between the maximum and minimum valid depth values. Six depth

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 3

MSBs Layer Encoder (m-bit) MSBsini (m bits)

Depth Partition

Padding

Pixel Domain Encoding

8 bit LSBs Layer Encoder

LSBs Layer Bitstream

MSBs Layer Decoder (m-bit)

Depth Merging

Mask Overlapping

Reconstructed Depth (n bits)

8 bit LSBs Layer Decoder

The architecture of the proposed high precision depth encoding and decoding. TABLE II T HE BLOCK PROPORTIONS OF PIXEL VALUE RANGE (%) Sequence Cubicle Movcam Player one Talking Dude Laurana

Fig. 2.

MSBs Layer Bitstream

QE Scaling

LSBsini (n-m bits)

Fig. 3.

Depth Quantization

Transmission

Depth (n bits)

Major depth Extraction

[0, 255] 80.7 85.1 78.5 78.7 88.2 93.6

Pixel value range [256, 1023] [1024, 4095] 18.1 1.2 11.9 3.0 14.3 7.2 13.8 7.5 9.6 2.2 6.4 0

to classify the pixels within one block and guarantee that the difference between the maximum and minimal depth values in each cluster is within a constraint. The clustering starts from one cluster. If the range of the cluster exceeds the constraint, we increase the cluster number until the constraint is met. The constraint is set as the form of 2l -1 and fours constraint values, which are 255, 127, 63 and 31, are set to perform the investigation. Given the distinct constraint value, we calculate the block proportion with the distinct cluster numbers based on 720000 blocks from six depth sequences and present the result in Fig. 2. It is obvious that if the constraint is tightened with a smaller difference value, there will be more clusters within one block. Note that using four clusters can represent more than 70% blocks and the quantization error can be bounded within 31. If the high precision depth data has n bit depths and highest m bit depths are extracted as one layer depth data, the result in Fig. 2 can also reflect the clustering results of this depth layer with the maximum quantization error as 2l−(n−m) − 1 correspondingly. This result indicates that high precision depth data owns clustering characteristic that will be more obvious for the depth data extracted from the high bit-depth levels.

The block proportion result on different cluster numbers.

B. Proposed framework sequences from two data sets shown in Fig. 1 are used as examples. The first four depth sequences are captured at 30fps with a resolution of 640x480 by Kinect. The last two depth sequences are captured at 30fps with a resolution of 800x480 by stereo depth camera. The depth value ranges of these six sequences are all [0, 4096-1] with 12 bit depths. The descriptions of the six sequences are listed in Table I. The proportions of the blocks with distinct local ranges are listed in Table II. From the results we can see that considerable blocks have a low pixel range. This result implies that depth value usually varies within a small range in the local region. The blocks with large range are likely located along the occlusion boundaries. Even though the depth data has a large pixel value range around the boundary, it is usually composed of several smooth regions located in the foreground and background and the pixel value range is small in each region. This data property implies that the depth value within the block is sparsely distributed and the pixels can be easily classified as several clusters. To investigate this, we use a constrained K-means

With respect to the property of the high precision depth data, we propose a layered depth compression framework to partition the depth into two layers in bit-depth level, the MSBs layer and the LSBs layer, with low pixel value ranges to represent the high precision depth. The MSBs layer contains the general depth information and gives a coarse grained description of the objects’ surfaces, while the LSBs layer records the remaining detailed depth information to describe the tiny variations on the surfaces. The distinct coding schemes are designed for each layer. The specific implementation of the proposed scheme is depicted in Fig. 3. Since the depth sensor is limited on accuracy and capability, depth value capture is usually failed due to several limitations [20, 21] and the pixels in these regions are invalid and represented as zero in depth map. To remove the sharp boundary caused by the invalid pixels and facilitate block-based coding, the depth frame is first fed into the padding module, where the invalid depth pixels in each block are padded by the average of valid pixel values within the block. Meanwhile, a binary mask map recording invalid depth positions is generated as the side

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 4

m bits

(n-m) bits

Depth (n bits)

m bits

(n-m) bits

MSBsini

LSBsini

Quantization

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

MSBsq 8-(n-m) bits QE+bias

LSBsq

Entropy coding

Fig. 4.

Image/video coding

The evolution of bit-depth allocation between two layers.

information, which will be entropy encoded and transmitted to the decoder for reconstruction. After padding, the high precision depth represented by nbit integer is partitioned into two layers in bit-depth level: the MSBs layer and the LSBs layer, initialized by the highest m-bit and the lowest remaining (n-m)-bit integers respectively shown in Fig. 4. In the MSBs layer, the depth value distributes sparsely with clustering property. Considering the efficiency of the transform-domain coding scheme is inadequate for the image compression with sharp edges, a pixel domain coding scheme is proposed that the MSBs layer depth in one block is represented as several major depth values and the corresponding index map through quantization and compressed by entropy coding. Since MSBs layer contains more significant depth information, the coding distortion is recorded and added back to the LSBs layer highest remaining bit-depth for further encoding as shown in Fig. 4. In the LSBs layer, the depth map is smooth after extracting sharp edge information in MSBs layer and the pixel values change in a continuous manner. Thus, the transform-domain coding scheme is employed for efficient compression. An adaptive quantization scheme with error control is proposed in MSBs layer coding to guarantee the LSBs layer depth data fed to encoder is within 8 bits, so that the existing optimized image/video codecs can be integrated directly. In the decoder, the MSBs/LSBs layer bitstreams are decoded respectively. Then the decoded two-layer depth data are merged as a high precision depth map. The reconstructed depth map is generated by merging the depth map with mask map. III. L AYERED DEPTH COMPRESSION A. MSBs Layer coding The MSBs layer block can be expressed concisely by several major depth values and the corresponding index map through quantization that the pixels with close depth values are clustered as one cluster and one representative depth value is selected to approximate all the pixels in this cluster. Assume

Fig. 5. The depth frame No. 63 of “player”. (a) Original depth map

(scaling to 8-bit to show); (b) Mask map (valid pixel is set as 0 while invalid pixel is set as 255) (c) M SBsini depth map with highest 8 bit depths as an example; (d) The quantized MSBs layer depth map M SBsq ; (e) Index map in MSBs layer with five grey levels (escaped pixel is set as 255); (f) LSBsq depth map; (g) Histogram of M SBsini ; (h) Histogram of LSBsq .

that an MSBs block with N pixels can be classified into k clusters. In each cluster Sk , all the pixels are quantized as one major depth value M Dk as: M SBsq (x) = M Dk , if x ∈ Sk ,

(1)

and the quantization error QEM SBs can be represented as: QEM SBs (x) = x − M Dk , if x ∈ Sk .

(2)

The index map indicates which major depth is used by each cluster. Then the quantized depth map M SBsq can be represented as the major depths and the index map which are compressed in a lossless way, while the quantization error QEM SBs will be fed to the LSBs layer. The related solutions for quantization and compression in pixel domain can be summarized as follows. Given a cluster number k, the quantization methods, such as vector quantization [22], dynamic programming [23] and K-means [24], are used to minimize the quantization error and the bitrate cost of the index map coding is calculated by performing the lossless entropy coding one time. Given the cluster number range [1, Kmax ], the optimal k with the minimal rate distortion cost

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 5

MD1

Frequency

MD0

Horizontal line-mode Vertical line-mode

MD2

MSBsini Value QW0

Fig. 6.

Escape Pixels

MD: major depth QW: quantization window

QW1

Prediction pattern mode

QW2

An example of major depth extraction and adaptive quanti-

zation.

will be chosen. However, there are mainly two disadvantages in these solutions. First, the complexity is too high to achieve real-time compression. Traditional quantization scheme and re-encoding operation to obtain an accurate bitrate cost will both introduce significant computation cost. Second, the solution could not guarantee that the quantization error is within a certain range. In this scheme, a histogram based quantization scheme is proposed to address these problems. 1) Histogram Based Quantization: In the proposed scheme, the major depth value in each cluster and the corresponding clustering range are determined based on the histogram of MSBs layer block. In order to reduce the quantization distortion, the peak value in the histogram is selected as one major depth for each time, shown as the thick red line in Fig. 6. Given the selected major depth, the neighboring depth values in the quantization window are quantized as the major depth. Then the pixel numbers of these depth values including major depth are set as zero in the histogram to indicate that these depth values have been quantized. Then the major depth is selected from the updated histogram and this process is performed several times until Kmax major depths are selected or all the non-zero depth values in the histogram are clustered. In the proposed scheme, the maximum major depth number Kmax is set as four in terms of the data property analyzed in the investigation result in Section II. Given one major depth, the neighboring depth values in the quantization window are quantized as the major depth. To ensure that the pixel value range of LSBsq is within 8 bits, the window size should be bounded. Assuming that depth data has n bits and the MSBs layer has m bits, there will be n-m bits initially fed to LSBs layer. Then the remaining highest 8-(n-m) bits in LSBs layer can be filled by the quantization error from MSBs layer as shown in Fig. 4. Considering the quantization error may be negative, all the errors are added a bias to guarantee the shifted errors for nonnegative representation. To guarantee that the bit-depth of the biased quantization error is no larger than 8-(n-m), the value of the biased quantization error must be within the range [0, 28−(n−m) − 1], and the maximum quantization window size should be 28−(n−m) − 1. The bound of the quantization step size is equal to the half of the maximum quantization window size as follows: Qstepmax = b(28−(n−m) − 1)/2c

(3)

The bias is equal to the quantization step size bound for nonnegative LSBsq generation. Note that given the data

Fig. 7.

2211111111111111 2111111111111111 1111111111111111 2221111000000000 2221110000000000 2221110000000000 2221100000000000 2221100000000000 2221000000000000 2222000000000000 2222000000000000 2222200000000003 2222220000000033 2222222000000333 2222222200003333 2222222220033333

1 0 2 Prediction

…… 0000001000000200

Prediction index map

An example of three modes of index map encoding.

format and bit allocation scheme between two layers, the quantization step size bound and bias are constants which are maintained in both encoder and decoder without transmission. Under the quantization bound, the quantization window size is adaptive for each major depth to balance the depth value allocation between two layers. Considering that the histogram distribution of depth value can offer useful hints to detect the depth discontinuity within the block, the quantization window size for each major depth value is adaptive as follows: both quantization step sizes on two sides increase from the major depth position in histogram, until reaching the maximum size or meeting the discontinuity position which is indicated as zero in histogram shown in Fig. 6. In this way, the pixels in the same object are most likely quantized to the same major depth. This will be beneficial in obtaining a smooth quantization error map added to LSBs layer for compression. Given the maximum major depth number, there might also be a small number of pixels that escape from the quantization windows. These escaped pixels are recorded for lossless coding to maintain the distortion as small as possible. 2) Pixel Domain Encoding: After quantization, the MSBs block data can be represented as the major depths and the corresponding index map. The index values will range from 0∼K, in which K is equal to the maximum major depth number Kmax . The index value 0∼K-1 means the major depths and K stands for the escaped pixels. In the proposed scheme, the major depths are predicted with the previously coded major depths and the prediction differences are entropy encoded. Since the index value K only indicates the current pixel is an escaped pixel, the pixel value should be encoded. To remove the redundancy, the escaped pixel is also predicted with the previous escaped pixel. We use one bit to indicate the prediction result. If the residual is equal to zero, only the prediction flag is written into the bitstream and the escaped pixel is reconstructed by the previous one. Otherwise, the escaped pixels are fix-length coded and the bit length of the symbol is equal to the bit number in MSBs layer. Since the index map costs a dominant number of bits, three coding modes are employed to fully exploit the spatial correlation within the block: horizontal line-mode, vertical line-mode and prediction pattern mode shown in Fig. 7. If the index map of one line has the same index value, the horizontal line-mode is used to encode the indices in this line. Only the mode flag and the index value are written into the bitstream. If one line has the same indices as the previous one, the

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 6

vertical line-mode will be used so that only the line mode flag is written into the bitstream. When the depth value changes frequently, the indices will be encoded pixel-by-pixel by a prediction pattern mode. We use the indices of the left and top pixels as prediction to exploit the two-dimensional correlation. If the current pixel is the same as the top one, the prediction index is set at 0. If it is the same as the left one, the prediction index is set at 1, otherwise the prediction index is set at 2. To utilize the global pattern correlation, we employ a pixel-group pattern coding method that each j prediction indices in a raster scanning order are grouped to form a pixel-group symbol that is entropy coded. During the index scanning, if the prediction index is equal to 2, the original index is written into the bitstream. The entropy rate will decrease when the pixel-group size increases. However, the sacrifice for lower entropy rate is the increase of the decoding complexity when variable length coding is employed. For instance, when the pixel-group size is 8, the number of different pixel-group prediction symbols will be 38 = 6561, and it is very difficult to implement the fast Huffman decoding. So in the implementation the pixel group size is set as four to achieve the trade-off between coding efficiency and decoding complexity. B. LSBs Layer Encoding In the proposed scheme, after extracting the most significant depth information from the original depth, the sharp edge feature is preserved in MSBs layer, while the generated LSBs layer has depth variation with a smaller range than the original depth. Therefore, a transform-domain image/video coding scheme is employed to further exploit the correlation in spatial and temporal domains. Due to the error control in the MSBs layer coding, the pixel value range of the LSBsq is within 0∼255. Most state-of-theart 8-bit codecs can be integrated into the proposed coding structure, such as JPEG, MPEG-2, H.264 and HEVC. The integrated codecs can be chosen in terms of the specified system requirements for coding efficiency and complexity. IV. B IT- DEPTH L EVEL PARTITION The bit-depth level partition between two layers determines the distribution of depth information and the coding performance in each layer. In order to optimize the coding efficiency, a frame-level rate distortion model is derived to assist with the bit depth partition between the two layers. Given the quantized MSBs layer encoded in pixel domain without distortion and the lossy compression in LSBs layer, the bit-depth partition problem is a rate-distortion optimization problem: J(m|QP, λ) = λm (RM SBs (m) + RLSBs (m|QP )) +DLSBs (m|QP )

(4)

where m is the bit number partitioned into the MSBs layer. QP is the quantization parameter in the LSBs layer coding, and λ is the Lagrangian multiplier. We need to find the mopt to minimize the rate-distortion cost J(mopt |QP, λ). Depth coding efficiency is highly related to inherent depth characteristics. That is, the complex depth texture may lead

Fig. 8.

Investigation result on MSBs layer data characteristics.

to an increase in both bitrate and distortion. In the proposed scheme, the high gradient information is preserved in the MSBs layer and compressed in pixel domain while the remaining smooth region data is fed to the LSBs layer for compression. Therefore, how much high gradient information is preserved in the MSBs layer directly influences the coding performance of the two layers. To evaluate the high gradient information amount, we count the high gradient pixel number within one block. Since two nonadjacent pixels in a histogram of the MSBs will be considered as two distinct objects, the current pixel is treated as one high gradient pixel if the difference between the current pixel and the left or top neighboring pixel is beyond an adaptive threshold 2n−m . In terms of this rule, the high gradient pixel number within one block HGm is calculated. We randomly select 4800 depth blocks from the six depth sequences in Fig. 1 and investigate the relationship between the high gradient pixel number and MSBs bit number, as shown in Fig. 8. When more bits are allocated to the MSBs layer, the depth map in the MSBs layer will be more precise and the high gradient pixel number will increase proportionally. Since the MSBs layer can be represented as major depths, index map and escaped pixels, the differential pixel number DFm in the index map and escaped pixel number NEscapes are introduced to reflect the texture complexity of the MSBs layer. In the index map, one pixel is set as a differential pixel, if it is different from the left or top neighboring pixel. The numeric relationships among the differential pixel number, escaped pixel number and MSBs bit number are depicted in Fig. 8. We can observe that both DFm and NEscapes increase monotonically with the increase of MSBs bit number. However, the magnitude of the increase changes when the MSBs bit number is set as different value. When m68, the magnitude of the increase is stable for both DFm and NEscapes , and the increase speed is similar to that of high gradient pixel number. When m>8, the increase magnitude of DFm is insignificant which is almost a constant. NEscapes still increases with MSBs bit number. The differential pixel number and escaped pixel number could be also investigated by the local spatial feature as the high gradient pixel number. Based on the investigation results obtained from 4800 depth blocks, we build up the bitrate

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 7

model in MSBs layer by approximately representing the DFm and NEscapes in terms of HGm as follows:  µHGm m ≤ 8 DFm = (5) µHG8 m > 8 and

 NEscapes =

m≤8 m>8

νHGm τ HGm + ω

(6)

These relationships are fitted by least square method. To achieve high fitting correlation and a simple relationship, piecewise linear fitting is employed in which the fitting is performed in two pieces partitioned by MSBs bit number 8. The fitting errors are 7.97% and 8.82% for Eqn. (5) and (6) respectively (µ = 1.62, ν = 1.07, τ = 0.91 and ω = 37.26). The two-stage models in Eqn. (5) and (6) can be easily explained in terms of the spatial properties of MSBs. The higher bit number is set in MSBs, the more spatial details are preserved in the MSBs layer. Accordingly, more complicated index map and more escape pixels will be generated. As the maximum major depth number is fixed, the differential pixel number will increase at first and achieve an upper bound with the increase of bit number in MSBs layer. The escaped pixel number will increase all along to represent the increasing pixels, which cannot be quantized as index pixels. The bitrate in MSBs is composed of three parts: RM SBs = RM D + RIndex + REscapes

(7)

where RM D , RIndex , REscapes represent the bitrate of major depths, index map and escaped pixels, respectively. Since the major depth bitrate cost is limited and nearly fixed, the bitrate cost in the MSBs layer is mainly caused by index map and escaped pixel coding. For index map coding, the bitrate of the index map is determined by the index map complexity. Based on the investigation results shown in Fig. 9(a) which are also obtained based on 4800 depth blocks, the index map bit number RIndex varies almost linearly with the differential pixel number DFm under different MSBs bit numbers. For escaped pixel coding, the bitrate cost varies linearly with NEscape as shown in Fig. 9(b). Based on the above analysis, the bitrate in the MSBs can be represented as: RM SBs (m) = αDFm + βNEscapes + mc0

(8)

Substituting Eqn. (5) and (6) into (8), the bitrate of the MSBs layer can be represented as: RM SBs (m) = a0 HGm + mc0

(m ≤ 8)

(9)

and RM SBs (m) = a1 HGm + b1 + mc0

(m > 8)

(10)

where, a0 = αµ + βν, a1 = βτ , b1 = αµ · HG8 + βω. In the LSBs layer, the bitrate cost can be estimated based on model [25] which is applicable to many typical transformdomain coding systems. In the model, the coding bitrate R and distortion D can be represented as functions of quantization step size as follows: RLSBs (m|QP ) =

θm Qstep

(a)

(b)

Fitting curves of the bitrate cost model in the MSBs layer. The fitting correlations are 0.9972 and 0.9981. Fig. 9.

(b)

(a)

Fig. 10. Fitting curves of the rate distortion model in the LSBs layer.

The fitting correlations are 0.9742 and 0.9819.

and DLSBs (m|QP ) = ρm · Qstep

(12)

where θm and ρm are constants related to the input data. After the extraction of high gradient information in the MSBs layer, the preserved smooth depth data with quantization errors will be fed to the LSBs layer. The additional error will still destroy the original data correlation to some degree. The impact on data correlation in the LSBs layer can be noted as Glm and estimated as follows: Glm =

1 (N − HGm ) · 2n−m 2N

(13)

where N is the pixel number within one block. The impact on the LSBs layer is mainly determined by two factors. One is the amount of pixels with quantization error, and the other is the quantization step size. Since the high contrast information is fully recorded in high gradient pixels in the MSBs layer and the quantization error is mainly located in the other pixels, the pixel number with quantization error is the difference between the total pixel number and the high gradient pixel number. Intuitively, the parameters θm and ρm will increase with the impact value Glm . To investigate the relationship between these parameters, we measure the variations of parameters θm and ρm under different pixel numbers in the MSBs layer using 4800 depth blocks and calculate the average value of Glm . The relationship is shown in the Fig. 10. The curves of θm -Glm and ρm -GLm can be fitted as follows: θm = al · ln(Glm ) + bl

(14)

ρm = cl · ln(Glm ) + dl .

(15)

and (11)

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 8

Player_one 75

70

70

PSNR(dB)

PSNR(dB)

Movcam 75

65

60

Reference Proposed

55

65

60

Reference Proposed

55

50

50 100

200

300

400

500

600

100

200

300

Kbits/fram

Talking

500

600

Dude

75

80

70

75

PSNR(dB)

PSNR(dB)

400

Kbits/frame

65

60

65

Reference Proposed

55

70

50

Reference Proposed

60 100

200

300

400

500

600

700

800

100

120

140

Kbits/frame

The bitrate cost under different bit-depth level partitions between two layers Fig. 11.

160

180

200

220

240

Kbits/frame

Fig. 12. Quantization performance evaluation based on rate distortion

results

Therefore, the overall rate-distortion cost of the two layers can be represented as a function of high gradient pixel number HGm as follows: n−m

al · ln((N − HGm ) · 2 /2N ) + bl Qstep +mc0 ) + (cl · ln((N − HGm ) · 2n−m /2N ) + dl ) · Qstep (16)

J(m) = λ · (a · HGm +

Next we will perform the bit-depth partition optimization based on the proposed rate distortion model. We employ H.264 as an example of the transform-domain coding scheme to conduct LSBs layer coding. The parameter λ is set as the traditional Lagrangian multiplier in H.264. We calculate the high gradient pixel number HGm in each frame and estimate the rate distortion cost value. Then the average rate distortion cost of the whole depth sequence is calculated under a different bit number in the MSBs layer. The estimated rate distortion cost under high quality (QP=16) in the LSBs layer is shown in Fig. 11 where the value in the y-axis is the rate distortion cost per pixel. We can observe that the rate distortion cost is minimal when 8 bits are allocated to the MSBs layer for Kinect depth data and minimal when 7 bits are allocated to the MSBs layer for stereo depth data. To prove this observation, we perform compression under different bit partitions. The RD cost based on the simulation is also shown in Fig. 11. We can see that the simulation result is matched with the estimated one. The same conclusion can be obtained with low coding quality in the LSBs layer based on both the estimated results and the simulation results. V. E XPERIMENTAL RESULTS In order to evaluate the effectiveness of the proposed coding scheme, we apply this scheme to high precision depth compression. We employ two data sets to perform the depth compression. One is Kinect depth data captured by Kinect at 30fps with a resolution of 640x480. The other one is stereo depth data captured by stereo depth camera at 30fps with a resolution of 800x480. The bit depths in the two data sets are both 12 bits with a pixel value range from 0 to 4095. Six depth sequences are selected from the two data sets as shown in Fig.

TABLE III Q UANTIZATION ERROR IN MSB S LAYER MEASURED BY MSE VALUE

Sequence MovCam Player one Talking Dude

Kmeans 0.26 0.32 0.34 0.16

Proposed 0.38 0.39 0.49 0.22

Error increase (%) 46.15 21.88 44.12 37.50

Error increase=(Proposed-Kmeans)/Kmeans*100%

1. In the experiment, the evaluation is implemented primarily based on three aspects: 1) evaluation of the quantization in the MSBs layer; 2) overall coding performance; 3) reconstructed depth quality based on non-viewing applications. We first introduce the detailed implementation of the proposed scheme. In the MSBs layer, two entropy coding methods are implemented which are arithmetic coding and variable length coding (VLC), respectively. In the LSBs layer, one image codec JPEG [26] and two video codecs (x264 [27] and HEVC HM2.0 [28]) are integrated into LSBs layer compression. The coding configuration in x264 is set as GOP structure IPPP, GOP size 16, reference number 1 and entropy coding CABAC. The coding configuration in HEVC is set as the default lowdelay with GOP structure IPPP, reference frame number 4, entropy coding CABAC. The binary mask map generated in the padding module is encoded by JBIG encoder [29]. All results of coding time are measured in the PC with an Inter Xeon 2.27GHz processor and 8G memory. In the experiment, the peak signal-to-noise ratio (PSNR) is used to measure the objective quality of the decoded depth as: P eak 2 ), (17) M SE where MSE denotes the mean square error between the reference depth and the reconstructed depth, and Peak is the peak value equal to (212 − 1) due to the large pixel value range. P SN R = 10 · log10 (

A. Evaluation on Quantization in MSBs Layer In this part, we will evaluate the efficiency of the proposed quantization scheme. A reference scheme based on K-means is

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 9

Movcam 75

70

70

65

65

PSNR(dB)

PSNR(dB)

Cubicle 75

60 55

JPEG_Extension JPEG_Layered Ours_JPEG

50

300

500

700

900

55

JPEG_Extension JPEG_Layered Ours_JPEG

50

45

45 100

60

100

1100

300

500

Kbits/frame

Player_one

1100

(a)

(b)

70

70

65

65

PSNR(dB)

PSNR(dB)

900

Talking

75

60 55

JPEG_Extension JPEG_Layered Ours_JPEG

50

Coding performance result of the depth sequence “player”. (a) the MSBs layer coding distortion and bitrate; (b) the relationship between bitrate cost of image based coding and bit-depth number. Fig. 14.

60 55

JPEG_Extension JPEG_Layered Ours_JPEG

50

45

45

100

300

500

700

900

1100

100

300

Kbits/frame

500

700

900

1100

1300

Kbits/frame

Dude

Laurana

80

85

75

80 75

PSNR(dB)

70

PSNR(dB)

700

Kbits/frame

65 60 55

JPEG_Extension JPEG_Layered Ours_JPEG

50 45

70 65 60

JPEG_Extension JPEG_Layered Ours_JPEG

55

50 45

100

200

300

Kbits/frame

400

500

50

100

150

200

250

300

Kbits/frame

Rate-distortion performance comparison of schemes based on image codecs Fig. 13.

introduced in MSBs layer, in which given a cluster number k, the quantization is performed based on K-means. To guarantee the 8-bit depths data format in LSBs layer, if the quantization error for one pixel is beyond the maximum error bound, this pixel will be set as the escaped pixel without quantization. For each depth block, the cluster number is searched from 1 to Kmax which is also equal to four. Then the cluster number with the minimal quantization error is selected. The quantization error measured by mean square error (MSE) value is compared in Table III. From the result, we can observe that the K-means scheme can achieve better clustering result and the gap between the proposed scheme and reference scheme is not large. We also compare the quantization schemes in terms of the overall rate distortion result. After quantization, the MSBs layer is entropy encoded by arithmetic coding and the LSBs layer is compressed by x264 encoder. The rate distortion comparison result is shown in Fig. 12. We can observe that the proposed scheme can save about 10% bitrate in average at high bitrate. Although the K-means scheme can achieve less quantization error and facilitate the MSBs layer coding, the generated LSBs layer in the proposed scheme is smoother than the reference scheme. The less bitrate cost in the LSBs layer in the proposed scheme leads to the overall coding performance improvement. B. Evaluation on Overall Coding Performance To further evaluate the overall coding performance, the image codec JPEG and video codecs x264, HEVC (HM2.0) are integrated into the LSBs layer compression. In image

based coding scheme, the variable length coding scheme is employed in MSBs layer coding while in video based coding scheme, both variable length coding and arithmetic coding are implemented. Furthermore, the extended versions of the JPEG and HEVC for 12 bit data compression are also selected as the reference coding schemes, denoted as JPEG extension and HEVC extension. To verify the efficiency of the proposed MSBs layer encoding, layered JPEG and layered HEVC are also implemented for comparison. In these layered coding schemes, the depth map is also partitioned into the MSBs and LSBs layers. The different implementation is that the MSBs layer data M SBsini is compressed in high quality using HEVC (QP=20) and JPEG (QP=90). Then the MSBs layer coding distortion is also added to the LSBs layer. Since accurate error control can hardly be achieved by the transformdomain coding approach, the pixel value range in LSBs layer is very likely higher than 255 and 12-bit HEVC extension and JPEG extension codecs are employed. The rate distortion compression results among JPEG related schemes are shown in Fig. 13. The JPEG layer performs worst among all schemes. That is because the depth partition actually destroys spatial correlation to some degree. Notice that the coding performance of the proposed scheme is much better than JPEG extension. That is because the proposed scheme can reconstruct the MSBs layer in high quality with a nearly fixed bitrate cost and preserves the bitrate cost of LSBs within a range. In contrast, the bitrate cost of JPEG extension will increase significantly when the bit-depth number increases. To further investigate the coding gain, we show some results of rate distortion study for the test sequence “Player” in Fig. 14. In Fig. 14(a), the coding performance in the MSBs layer is evaluated by calculating the MSE value of the MSBs layer depth data under different overall bitrates corresponding to the different QPs in LSBs layer. We can observe that the distortion in MSBs layer is preserved in a small range in the proposed coding scheme, that is because the quantized depth map in the MSBs layer is compressed without distortion. Moreover, the quantization error of the MSBs layer is fed to the highest bit-depth in LSBs layer and the lossy coding in LSBs layer introduces limited distortion on the high bit-depth value. In Fig. 14(b), the bitrate costs under different bitdepth numbers are shown. We scale the 12 bits sequence to 8-11 bits sequences respectively. The 8-bit depth sequence is encoded

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 10

Movcam

Cubicle 80

75

Player_one

75

75

70

70

65

PSNR(dB)

PSNR(dB)

PSNR(dB)

70 65

60

HEVC_Extension HEVC_Layered Ours_HEVC (Arithmetic Coding) Ours_x264 (Arithmetic Coding) Ours_x264 (Vlc Coding)

60

55

50 0

100

200

300

400

500

60

HEVC_Extension HEVC_Layered Ours_HEVC (Arithmetic Coding) Ours_x264 (Arithmetic Coding) Ours_x264 (Vlc Coding)

55

50

HEVC_Extension HEVC_Layered Ours_HEVC (Arithmetic Coding) Ours_x264 (Arithmetic Coding) Ours_x264 (Vlc Coding)

55

50 0

600

65

100

200

300

400

500

600

700

0

100

200

Kbits/frame

Kbits/frame

500

600

90

80

85

75

70

400

Laurana

Dude

Talking 75

300

Kbits/frame

80 65

PSNR(dB)

PSNR(dB)

PSNR(dB)

70

65

60

HEVC_Extension HEVC_Layered Ours_HEVC (Arithmetic Coding) Ours_x264 (Arithmetic Coding) Ours_x264 (Vlc Coding)

55

50 0

100

200

300

400

Kbits/frame

Fig. 15.

500

600

700

800

55

70 65

HEVC_Extension HEVC_Layered Ours_HEVC (Arithmetic Coding) Ours_x264 (Arithmetic Coding) Ours_x264 (Vlc Coding)

60

75

HEVC_Extension HEVC_Layered Ours_HEVC (Arithmetic Coding) Ours_x264 (Arithmetic Coding) Ours_x264 (Vlc Coding)

60 55 50

50 50

100

150

200

0

250

10

20

30

40

50

60

Kbits/frame

Kbits/frame

Rate-distortion performance comparison of schemes based on video codecs

by JPEG directly. The depth sequence with more than 8 bits is encoded by both JPEG extension and the proposed scheme. In the proposed scheme, the highest 8 bits are partitioned to the MSBs layer while remaining bits are sent to the LSBs layer. We control the coding distortion to be approximately the same (MSE=6.5) and compare the bitrate cost. Due to the lossless coding in pixel domain, the bitrate cost in the MSBs layer is almost fixed within a certain range. In the LSBs layer, the bitrate cost is also bounded in a small range since the data range is not beyond 8-bit integer. In contrast, the bitrate cost of JPEG extension increases rapidly with the increase of bit-depth number. Thus, the overall bitrate cost is significantly saved compared with the reference scheme. The RD performance results of the video based coding schemes are shown in Fig. 15. Similar to the image based coding results, the HEVC layer performs worst due to the spatial and temporal correlations destroyed. Since the HEVC codec is more efficient than H.264 for video coding, and arithmetic coding is more efficient than VLC coding, the combination “Ours HEVC arithmetic” has the best coding performance in the proposed schemes, followed by the combinations “Ours x264 arithmetic” and “Ours x264 vlc”. For the Kinect data set, the coding efficiency of the proposed approach with x264 is comparable with the HEVC extension. The proposed scheme with HEVC outperforms HEVC extension and can save 9.2% in average at high bitrate. The reasons of the performance improvement are mainly two folds. First, the transform-domain video codec cannot fully exploit the correlation around the complicated and irregular edges in the depth map. However, as a non-transform based coding scheme, the pixel domain coding in the MSBs layer can efficient compress the depth block with sharp edge information. Second, the Kinect depth contains noises and holes in temporal and spatial domains which will destroy the data correlation [12] and reduce the coding performance of video codec in some degree. In contrast, the padding module in the proposed framework

TABLE V C OMPLEXITY COMPARISON OF SCHEMES BASED ON IMAGE CODECS ( MS / FRAME ) Sequence Cubicle Movcam Player one Talking Dude Laurana

Ours JPEG Enc. Dec. 23.48 23.34 22.95 19.26 23.91 21.35 26.70 22.87 22.53 18.00 20.96 16.14

JPEG Enc. Dec. 24.46 21.56 23.99 20.65 24.66 21.51 27.30 23.14 23.51 17.16 20.53 15.32

Ratio Enc. Dec. 1.04 0.92 1.04 1.07 1.03 1.01 1.02 1.01 1.04 0.95 0.98 0.95

Ratio=Ref./Ours TABLE VI C OMPLEXITY OF THE PROPOSED SCHEME BASED ON VIDEO CODEC ( MS / FRAME ) Sequence Cubicle Movcam Player one Talking Dude Laurana

Ours x264 Vlc Enc. Dec. 27.71 23.34 29.23 20.56 28.09 21.87 30.45 22.64 25.86 15.19 23.21 13.53

Ours x264 Arithmetic Enc. Dec. 34.05 19.73 33.86 17.47 34.08 17.97 35.31 19.09 30.54 14.16 26.23 13.30

can avoid the sharp edges caused by the failure depth capture and make the data friendlier for compression. Moreover, the temporal correlation is also substantially exploited by video codec in the LSBs layer. For the depth sequences “Dude” and “Laurana” captured by stereo camera, the depth data quality is better than Kinect data with less noises and holes. The pixel value range of the overall map is limited and less sharp edges exist among objects. Thus, the coding performance of HEVC is close or better than the proposed scheme with HEVC in LSBs layer. For the “Dude” sequence, HEVC scheme can save about 12% bitrate at high bitrate. We also show the bitrate of each layer in Table IV, in which the MSBs layer result of the proposed scheme is based on arithmetic coding. In the experiment, we control the coding

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 11

TABLE IV T HE BITRATE COST OF EACH LAYER COMPARISON ( BITRATE : K BITS / FRAME , PSNR: D B) Sequence

MSBs 50.17 50.82 69.59 40.12

MovCam Player one Talking Dude

HEVC Layer LSBs PSNR 387.06 67.34 373.73 67.24 222.06 60.63 135.50 72.55

MSBs 142.85 173.13 176.98 93.75

Ours HEVC LSBs 106.90 89.41 32.48 52.23

PSNR 67.66 67.57 60.32 72.61

MSBs 142.85 173.13 176.98 93.75

Ours x264 LSBs 163.45 140.66 50.57 67.13

PSNR 67.62 67.32 60.40 72.34

Fig. 17. Four gestures designed in the gesture recognition. From left

to right: Up, Down, Go, Back. (a)

(b)

(c)

Fig. 16. Coding performance comparison over synthesis view of “Cubicle” sequence. Partial regions in the synthesized frames from (a) uncompressed depth images, (b) depth images decoded by the proposed scheme (371Kbits/frame, 36.32dB), (c) depth images decoded by HEVC (377Kbits/frame, 35.71dB).

distortion of the overall 12-bit depth maps to be approximately the same and compare the bitrate in each layer. From the result, we can observe that the bitrate of MSBs layer in HEVC layer scheme is smaller than the proposed scheme, since the HEVC codec can exploit the MSBs layer data correlation efficiently in both temporal and spatial domains without considering the error control. However, the bitrate of LSBs layer in HEVC layer scheme is significantly larger than the proposed scheme. That is because the LSBs layer data generated in HEVC layer scheme is out of range of 255 and should be compressed by 12-bit HEVC extension. In contrast, the LBSs layer data in the proposed scheme is within 8bit. Moreover, the quantization scheme in MSBs layer can facilitate to generate a smooth LSBs layer for compression. The less bitrate cost in the LSBs layer in the proposed scheme leads to the overall coding performance improvement. Table V shows the coding complexity comparison result based on image codec. We can observe that the encoding and decoding time of the proposed scheme is comparable with the JPEG extension in a low complexity level. Table VI shows the encoding and decoding time of the proposed coding scheme based on video codec. We can observe that the real-time coding frame rate can be achieved when x264 is integrated in LSBs layer. That is because the complexity is fully considered in the proposed scheme. In the MSBs layer, the complexity of the quantization and entropy coding is preserved at a low level. In the LSBs layer, 8-bit optimized video codec can be integrated and implemented in real-time. Moreover, the complexity of the preprocessing, such as padding and mask coding, is also maintained at a low level. In contrast, the realtime coding can hardly be achieved by the standard video codec. So far, there is few coding resource with the optimized implementation based on data format more than 8-bit depths.

One intermediate result generated from the reconstructed depth frame used for gesture recognition. Left: Generated frame from the proposed coding scheme. Right: Generated frame from HEVC. Fig. 18.

C. Evaluation on Non-viewing Applications Considering that depth data is not directly used for viewing in many applications, we also evaluate the reconstructed depth quality in terms of several non-viewing applications. First, we utilize the depth information to render the synthesized view. The sequence “Cubicle” captured by Kinect is used for the view synthesis with the camera calibration information. We shift the view point within a line by using the DIBR [30], which is implemented by [31] with the hole-filling post processing. We first synthesize the new view in terms of the original texture and depth sequences and use the synthesis view as the reference for the PSNR value calculation. Then the synthesis view is rendered in terms of the decoded depth from the proposed coding scheme and the HEVC scheme, respectively. In the proposed scheme, the arithmetic coding is used in MSBs layer and x264 is used in LSBs layer. We control the bitrate of depth coding to be similar and compare the synthesis view quality. For the “Cubicle”, the bitrate of the depth sequence is 371Kbits/frame for the proposed scheme and 377Kbits/frame for HEVC. The PSNR value of the synthesis view sequence is 36.32dB for the proposed scheme and 35.71dB for HEVC. The performance gain is mainly because that the sharp edge in depth map can be well preserved in the proposed scheme with pixel domain coding in MSBs layer. The visual quality comparison is shown in Fig. 16, in which we can see that the synthesis view from the proposed scheme is similar with the result from the original depth with clear boundaries among the objects. We also investigate the impact of depth compression error

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 12

TABLE VII G ESTURE RECOGNITION RESULTS BASED ON THE COMPRESSED DEPTH DATA Original Bitrate(Mbps) MSE CRR

—– —– 93.18%

Ours JPEG (QP=90) 21.34 0.42 90.91%

Ours JPEG (QP=70) 14.29 2.47 90.91%

on the performance of gesture recognition. In this experiment, we focus on gesture recognition in a presentation scenario in which four gestures are defined to control the screen content switch: up, down, go and back as shown in Fig. 17. Gesture recognition is employed using a decoded depth map based on a motion history images (MHI) [32] recognition algorithm. We capture nine sequences including all four gestures. Gesture recognition is performed based on the original depth map, the reconstructed depth map from HEVC and the reconstructed depth map from the proposed scheme, respectively using the same recognition implementation. The average results of the compression performance and the correct recognition rate (CRR) are shown in Table VII. The results show that the coding performance of the proposed scheme outperforms that of the HEVC for all the sequences used in the presentation. Moreover, the CRR based on the compressed depth data by the proposed scheme is much higher than that of HEVC and close to the result based on the original depth data. This is because the error control algorithm is adopted in the proposed scheme. The MSBs layer compression is lossless while the distortion may only be introduced in the LSBs layer within a limited range. Fig. 18 shows one intermediate processing result in gesture recognition. We can see that there exist artifacts around the boundary of the people and in the background of the screen in the depth map from HEVC reconstruction, which will lead to failures in hand tracking. Notice that the reconstructed depth map by the proposed scheme provides a clear boundary between people and screen background that will benefit gesture recognition. VI. C ONCLUSION In this paper, we have presented a layered compression scheme for high precision depth data. There are three main advantages to the proposed scheme. First, we utilize a simple partition method at bit-depth level to enable an extended encoding range of a state-of-the-art 8-bit image/video codec for depth data with higher bit depths. Second, an error controllable coding scheme has been designed to better preserve depth features for relevant applications. Third, the proposed coding scheme can achieve real-time encoding and decoding rates with a low bitrate cost and can easily be integrated into a high precision depth data transmission system. R EFERENCES [1] S. B. Gokturk, H. Yalcin, and C. Bamji, “A time-of-flight depth sensor-system description, issues and solutions,” in Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference on, pp. 35–35, IEEE, 2004. [2] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory and experiment,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 16, no. 9, pp. 920–932, 1994.

Ours JPEG (QP=50) 12.95 4.06 90.91%

HEVC (QP=-10) 22.52 0.85 55.60%

HEVC (QP=-5) 18.37 3.00 58.30%

HEVC (QP=0) 12.84 13.90 54.20%

[3] http://reseacrh.microsoft.com/enus/u-m/redmond/projects/Kinect. [4] S. Grewatsch and E. Muller, “Evaluation of motion compensation and coding strategies for compression of depth map sequences,” in Optical Science and Technology, the SPIE 49th Annual Meeting, pp. 117–124, International Society for Optics and Photonics, 2004. [5] Y. Morvan, D. Farin, and P. H. De With, “Depth-image compression based on an rd optimized quadtree decomposition for the transmission of multiview images,” in Image Processing, 2007. ICIP 2007. IEEE International Conference on, vol. 5, pp. V–105, IEEE, 2007. [6] S. Milani and G. Calvagno, “A depth image coder based on progressive silhouettes,” Signal Processing Letters, IEEE, vol. 17, no. 8, pp. 711– 714, 2010. [7] G. Shen, W.-S. Kim, S. K. Narang, A. Ortega, J. Lee, and H. Wey, “Edge-adaptive transforms for efficient depth map coding,” in Picture Coding Symposium (PCS), 2010, pp. 566–569, IEEE, 2010. [8] A. S´anchez, G. Shen, and A. Ortega, “Edge-preserving depth-map coding using graph-based wavelets,” in Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on, pp. 578–582, IEEE, 2009. [9] M. Maitre and M. N. Do, “Depth and depth–color coding using shape-adaptive wavelets,” Journal of Visual Communication and Image Representation, vol. 21, no. 5, pp. 513–522, 2010. [10] S. Mehrotra, Z. Zhang, Q. Cai, C. Zhang, P. Chou, et al., “Lowcomplexity, near-lossless coding of depth maps from kinect-like depth cameras,” in Multimedia Signal Processing (MMSP), 2011 IEEE 13th International Workshop on, pp. 1–6, IEEE, 2011. [11] C. Wang, S. Chan, C. Ho, A. Liu, and H. Shum, “A real-time imagebased rendering and compression system with kinect depth camera,” in Digital Signal Processing (DSP), 2014 19th International Conference on, pp. 626–630, IEEE, 2014. [12] J. Fu, D. Miao, W. Yu, S. Wang, Y. Lu, and S. Li, “Kinect-like depth data compression,” Multimedia, IEEE Transactions on, vol. 15, no. 6, pp. 1340–1352, 2013. [13] G. Ward and M. Simmons, “Subband encoding of high dynamic range imagery,” in Proceedings of the 1st Symposium on Applied Perception in Graphics and Visualization, pp. 83–90, ACM, 2004. [14] Z. Mai, H. Mansour, P. Nasiopoulos, and R. K. Ward, “Visually favorable tone-mapping with high compression performance in bit-depth scalable video coding,” Multimedia, IEEE Transactions on, vol. 15, no. 7, pp. 1503–1518, 2013. [15] G. Ward, “Real pixels,” Graphics Gems II, pp. 80–83, 1991. [16] R. Bogart, F. Kainz, and D. Hess, “Openexr image file format,” ACM SIGGRAPH 2003, Sketches & Applications, 2003. [17] G. W. Larson, “Logluv encoding for full-gamut, high-dynamic range images,” Journal of Graphics Tools, vol. 3, no. 1, pp. 15–31, 1998. [18] A. Gelman, P. L. Dragotti, and V. Velisavljevi´c, “Multiview image coding using depth layers and an optimized bit allocation,” Image Processing, IEEE Transactions on, vol. 21, no. 9, pp. 4092–4105, 2012. [19] D. Miao, J. Fu, Y. Lu, S. Li, and C. W. Chen, “Layered compression for high dynamic range depth,” in Visual Communications and Image Processing (VCIP), 2012 IEEE, pp. 1–6, IEEE, 2012. [20] K. Khoshelham, “Accuracy analysis of kinect depth data,” in ISPRS workshop laser scanning, vol. 38, p. W12, 2011. [21] S.-W. Jung, “A bilateral hole filling algorithm for time-of-flight depth camera,” in IS&T/SPIE Electronic Imaging, International Society for Optics and Photonics, 2013. [22] R. M. Gray, “Vector quantization,” ASSP Magazine, IEEE, vol. 1, no. 2, pp. 4–29, 1984. [23] X. Wu, “Color quantization by dynamic programming and principal analysis,” ACM Transactions on Graphics (TOG), vol. 11, no. 4, pp. 348–372, 1992. [24] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” Applied statistics, pp. 100–108, 1979. [25] T. Chiang and Y.-Q. Zhang, “A new rate control scheme using quadratic rate distortion model,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 7, no. 1, pp. 246–250, 1997.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2481324, IEEE Transactions on Image Processing 13

[26] G. K. Wallace, “The jpeg still picture compression standard,” Communications of the ACM, vol. 34, no. 4, pp. 30–44, 1991. [27] x264. http://www.videolan.org/developers/x264.html. [28] http://hevc.kw.bbc.co.uk/trac/browser/jctvc-hm/branches/HM-2.0-dev. [29] JBIG. http://www.jpeg.org/jbig/index.html. [30] C. Fehn, “Depth-image-based rendering (dibr), compression, and transmission for a new approach on 3d-tv,” in Electronic Imaging 2004, pp. 93–104, International Society for Optics and Photonics, 2004. [31] MPEG-FTV Group, View synthesis reference software (VSRS) 3.0. [32] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, no. 3, pp. 257–267, 2001.

Dan Miao received the B.E. degree from University of Science and Technology of China (USTC) in 2009. He is currently working toward the Ph.D.degree in Department of Electrical Engineering, USTC. He has been a research intern in Microsoft Research Asia. Currently, his research interests focus on image and video coding, 3D video compression, rendering and transmission.

Jingjing Fu received the B.Sc. degree in electrical engineering from the University of Science and Technology of China, Hefei, China, in 2005 and the Ph.D. degree in electronic and computer engineering from Hong Kong University of Science and Technology, Kowloon, Hong Kong, in 2010. She is currently an associate researcher at the Media Computing Group, Microsoft Research Asia. Her areas of interest include image/video coding, 3-D image/video processing and multimedia system.

Yan Lu received his Ph.D. degree in computer science from Harbin Institute of Technology, China, in 2003. He joined Microsoft Research Asia in 2004, where he is now a senior researcher and research manager. In recent years, he has been leading a team of researchers, developers and designers, to develop end-to-end systems around screen video coding, media communications, remote computing, virtualization, user interface and hardware. From 2001 to 2004, he was a team lead at the Joint R&D Lab (JDL) for advanced computing and communication, Chinese Academy of Sciences, China. From 1999 to 2000, he was with the City University of Hong Kong as a research assistant. His research interests include image and video coding, multimedia system, networking, remote computing and user interface. He serves as an Editorial Board member of Journal of Visual Communication and Image Representation. He holds 20+ granted US patents. In 2007, he received the IS&T/SPIE VCIP Best Paper Award.

Shipeng Li (F ’10) received the B.S. and M.S. degrees from the University of Science and Technology of China (USTC), Hefei, China, in 1988 and 1991, respectively, and the Ph.D. degree from Lehigh University, Bethlehem, PA, USA, in 1996, all in electrical engineering He joined Microsoft Research Asia (MSRA) in May 1999. He is currently a principal researcher and research manager of the Media Computing group. He also serves as the research area manager coordinating the multimedia research activities at MSRA. From October 1996 to May 1999, he was with the Multimedia Technology Laboratory at Sarnoff Corporation (formerly David Sarnoff Research Center and RCA Laboratories) as a member of the technical staff. He has been actively involved in research and development in broad multimedia areas. He has made several major contributions adopted by MPEG-4 and H.264 standards. He invented and developed the world first cost-effective highquality legacy HDTV decoder in 1998. He started P2P streaming research at MSRA as early as in August 2000. He led the building of the first working scalable video streaming prototype across the Pacific Ocean in 2001. He has been an advocate of scalable coding format and is instrumental in the SVC extension of H.264/AVC standard. He first proposed the 694; Media 2.0 concepts that outlined the new directions of next generation internet media research in 2006. He has authored or co-authored six books/book chapters and over 280 referred journal and conference papers and holds over 140 granted U.S. patents in image/video processing, compression and communications, digital television, multimedia, and wireless communication.

Chang Wen Chen (F ’04) is a Professor of Computer Science and Engineering at the University at Buffalo, State University of New York. He has been Allen Henry Endow Chair Professor at the Florida Institute of Technology from July 2003 to December 2007. He was on the faculty of Electrical and Computer Engineering at the University of Rochester from 1992 to 1996 and on the faculty of Electrical and Computer Engineering at the University of Missouri-Columbia from 1996 to 2003. He has been the Editor-in-Chief for IEEE Trans. Multimedia since January 2014. He has also served as the Editor-in-Chief for IEEE Trans. Circuits and Systems for Video Technology from 2006 to 2009. He has been an Editor for several other major IEEE Transactions and Journals, including the Proceedings of IEEE, IEEE Journal of Selected Areas in Communications, and IEEE Journal on Emerging and Selected Topics in Circuits and Systems. He has served as Conference Chair for several major IEEE, ACM and SPIE conferences related to multimedia, video communications and signal processing. His research is supported by NSF, DARPA, Air Force, NASA, Whitaker Foundation, Microsoft, Intel, Kodak, Huawei, and Technicolor. He received his BS from University of Science and Technology of China in 1983, MSEE from University of Southern California in 1986, and Ph.D. from University of Illinois at Urbana-Champaign in 1992. He and his students have received eight (8) Best Paper Awards or Best Student Paper Awards over the past two decades. He has also received several research and professional achievement awards, including the Sigma Xi Excellence in Graduate Research Mentoring Award in 2003, Alexander von Humboldt Research Award in 2009, and the State University of New York at Buffalo Exceptional Scholar Sustained Achievement Award in 2012. He is an IEEE Fellow and an SPIE Fellow.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Layered compression for high-precision depth data.

With the development of depth data acquisition technologies, access to high-precision depth with more than 8-b depths has become much easier and deter...
1KB Sizes 0 Downloads 37 Views