Bit allocation algorithm with novel view synthesis distortion model for multiview video plus depth coding.

3254

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

Bit Allocation Algorithm With Novel View Synthesis Distortion Model for Multiview Video Plus Depth Coding Tae-Young Chung, Member, IEEE, Jae-Young Sim, Member, IEEE, and Chang-Su Kim, Senior Member, IEEE

Abstract— An efficient bit allocation algorithm based on a novel view synthesis distortion model is proposed for the ratedistortion optimized coding of multiview video plus depth sequences in this paper. We decompose an input frame into nonedge blocks and edge blocks. For each nonedge block, we linearly approximate its texture and disparity values, and derive a view synthesis distortion model, which quantifies the impacts of the texture and depth distortions on the qualities of synthesized virtual views. On the other hand, for each edge block, we use its texture and disparity gradients for the distortion model. In addition, we formulate a bit-rate allocation problem in terms of the quantization parameters for texture and depth data. By solving the problem, we can optimally divide a limited bit budget between the texture and depth data, in order to maximize the qualities of synthesized virtual views, as well as those of encoded real views. Experimental results demonstrate that the proposed algorithm yields the average PSNR gains of 1.98 and 2.04 dB in two-view and three-view scenarios, respectively, as compared with a benchmark conventional algorithm. Index Terms— Multi-view video plus depth, virtual view synthesis, view synthesis distortion, bit allocation, and rate-distortion optimization.

I. I NTRODUCTION

T

HE human visual system perceives the depths of real world objects using oculomotor cues [1] and visual cues [2]. For decades, a lot of research has been performed to convey depth information using two-dimensional images. Stereoscopic systems use binocular cues to provide depth feeling (stereopsis [3]), by displaying two different

Manuscript received August 31, 2013; revised March 2, 2014 and March 23, 2014; accepted May 20, 2014. Date of publication June 2, 2014; date of current version June 23, 2014. This work was supported in part by the National Research Foundation of Korea through the Ministry of Science, ICT and Future Planning (MSIP), Korean Government, under Grant 2012-011031, in part by the Global Frontier Research and Development Program on HumanCentered Interaction for Coexistence through the MSIP, National Research Foundation (NRF), under Grant 2011-0031648, and in part by the Basic Science Research Program through the Ministry of Education, NRF, under Grant 2013R1A1A2011920. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Béatrice PesquetPopescu. T.-Y. Chung is with the Software Research and Development Center, Samsung Electronics Company, Ltd., Suwon 442-600, Korea (e-mail: [email protected]). J.-Y. Sim is with the School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology, Ulsan 689-798, Korea (e-mail: [email protected]). C.-S. Kim is with the School of Electrical Engineering, Korea University, Seoul 136-701, Korea (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2327801

images to the two eyes. However, stereoscopic systems often require special glasses and suffer from incorrect depth perception when users move around. Auto-stereoscopic systems [4]–[6] or multi-view video (MVV) systems offer realistic depth perception, without requiring special glasses, by exploiting motion parallax cues [7] in addition to binocular cues. MVV systems use images, captured from multiple viewpoints. In addition to the multiple real views, these systems can also synthesize an arbitrary intermediate view from the nearby views. Thus, they can provide a pair of stereo images, which are composed of real and/or synthesized views, at an arbitrary viewpoint. Therefore, MVV systems can be used for free-viewpoint TV [8]. An MVV system captures more than two views, and the amount of MVV data is proportional to the number of captured views. MVV data thus require a large amount of storage space or transmission bandwidth. Much effort has been made to efficiently compress MVV data [9], [10], but the compression performances are not good enough to facilitate practical applications. To represent MVV data more compactly, a new format, called video plus depth, was proposed in the European project “Advanced Three-Dimensional Television System Technologies” (ATTEST) [11]. In this format, a single texture video and its corresponding depth video are employed together to reconstruct a pair of stereo images at an arbitrary viewpoint using the depth image-based rendering (DIBR) techniques [12]. Video plus depth can reduce the number of captured views as compared with the MVV format. However, it suffers from artifacts in reconstructed images, mainly due to occlusions, when the viewpoint of a rendered video is substantially different from that of a captured video. To overcome this drawback and provide consistent stereo images, the multi-view video plus depth (MVD) format was also proposed [13]–[15], which consists of at least two pairs of texture and depth videos. Fig. 1 illustrates an auto-stereoscopic system based on the MVD format with three encoded views V 1 , V3 , and V5 . The texture and depth videos of views V1 , V3 , and V5 are encoded and transmitted. In this example, at the receiver side, two virtual views V2 and V4 are synthesized between the transmitted real views using the DIBR techniques. Thus, a user can have depth perception from an arbitrary viewpoint between the viewpoints of the real views. MVD coding has attracted much attention. Merkle et al. [16] showed that depth videos also can be encoded using the conventional MVV coding algorithm [9], which are, however,

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

CHUNG et al.: BIT ALLOCATION ALGORITHM WITH NOVEL VIEW SYNTHESIS DISTORTION MODEL

Fig. 1.

3255

An auto-stereoscopic system based on the multi-view video plus depth (MVD) format.

optimized for texture videos only. Note that moderate errors in depth data may cause large rendering errors in DIBR, degrading the quality of a synthesized view severely. Hence, depth video coding techniques have been developed mainly to minimize view synthesis distortions [17]–[21]. Furthermore, efficient bit allocation algorithms, considering texture and depth videos together, have been proposed [12], [22]–[26]. The effects of texture and depth data on view synthesis distortions have been empirically observed in [12] and [22] to determine a fixed ratio of bit-rates between texture and depth videos [23]. Dynamic bit allocation schemes have been also proposed [24]–[26], but they encode texture and depth videos multiple times to estimate view synthesis distortions, resulting in high computational complexity in general. We propose a rate-distortion (R-D) optimized coding algorithm for MVD data, which estimates the distortions of intermediate virtual views without the iterative encoding and synthesis procedures. We partition an image into blocks and model texture and disparity values in each block as planar distributions. Then, we estimate the distortions of synthesized virtual views from those of encoded real texture and depth images. Based on this distortion model, we solve a bit allocation problem for MVD coding, by minimizing the distortions of the virtual views as well as those of the real views, subject to the constraint of a total bit budget. More specifically, bit-rates are first allocated to the real views and then allocated to the temporal frames in each view. Finally, the optimal bit-rates for the texture and depth images in each frame are determined. Simulation results demonstrate that the proposed algorithm outperforms the conventional MVD coding algorithms in terms of both R-D performance and computational complexity. The remainder of the paper is organized as follows: Section II reviews previous work. Section III proposes the distortion model for virtual view synthesis. Section IV explains the bit allocation algorithm. Section V presents simulation results. Finally, concluding remarks are given in Section VI.

and depth video coding, in which bit-rates are allocated to texture and depth data based on view synthesis distortion models. In this section, the PSNR gain of each algorithm is measured by the Bjontegaard method [27] in comparison with the H.264/AVC standard, unless otherwise specified. A. Depth Video Coding Previous depth video coding techniques consider the impacts of depth errors on the qualities of synthesized virtual views. Kim et al. [17] exploit texture characteristics to encode depth data: when a texture block is skipped for encoding, the corresponding depth block is also skipped. In addition, they estimate view synthesis distortions locally, assuming that local characteristics of original textures are similar to those of the corresponding synthesized textures [18]. Their algorithms provide PSNR gains of about 0.9dB [17] and 0.6dB [18]. Zhang et al. [19] assume that, in the frequency domain, the relationship between depth distortions and the quality of view synthesis is similar to that between motion distortions and the quality of motion compensation. Then, they estimate view synthesis distortions by weighting depth distortions with the spectral energies of the texture data. Their algorithm achieves about 1.8dB PSNR gain, but is limited to the intra coding mode only. Chung et al. [20] estimate view synthesis distortions also in the frequency domain and quantize depth blocks adaptively, yielding a PSNR gain of up to 6.5dB. However, their algorithm requires high computational complexity to encode blocks iteratively using various quantization parameters. Tech et al. [21] measure the change of a view synthesis distortion according to the change in a depth distortion. They integrate their distortion metric into the 3D extension of the High Efficiency Video Coding standard (3D-HEVC), and achieve about 0.6dB PSNR gain against the conventional 3D-HEVC. These conventional algorithms improve the R-D performances of depth video coding. However, they consider depth errors only to evaluate the qualities of synthesized views, even though texture errors also affect the synthesized views.

II. P REVIOUS W ORK This section reviews previous MVD coding techniques. We first describe depth video coding techniques, which consider the effects of depth errors only on the virtual view synthesis. Then, we survey bit allocation techniques for texture

B. Bit Allocation for Texture and Depth Video Coding Fehn [12] shows that a depth video can be encoded using only 10∼20% of the bit-rate for a texture video, due to the relative smoothness of depth data. Recently, Bosc et al. [22]

3256


investigate how the bit-rate ratio between texture and depth data affect the qualities of synthesized views. They observe that the depth bit-rate should be about 40∼60% of the total bit-rate to maximize the view synthesis performance. Liu et al. [23] also determine a fixed bit-rate ratio between texture and depth data and develop a three-level bit allocation strategy: view level, texture and depth video level, and frame level. Since these algorithms fix the bit-rate ratios for texture and depth videos, they cannot always synthesize the highest quality virtual views for various video sequences. Therefore, several methods have been developed to dynamically allocate the bit-rates between texture and depth data. Morvan et al. [24] encode texture and depth videos multiple times with a pre-defined set of quantization parameters and measure the distortions of virtual views, synthesized from the reconstructed texture and depth videos. Based on this actual R-D behavior, they exhaustively search the optimal quantization parameters to minimize the view synthesis distortion subject to a given bit-rate constraint. To reduce the high computational complexity, they search for optimal quantization parameters hierarchically. Liu et al. [25] encode texture videos multiple times to get the optimal quantization parameter for texture data and then estimate the optimal quantization parameter for depth data using a depth-induced view synthesis distortion model. Even though Morvan et al. [24] and Liu et al. [25] provide about 1dB PSNR gains against the H.264/AVC simulcast coding with a fixed bit-rate ratio, texture or depth videos are encoded multiple times with several quantization parameters, which introduces a significant computational complexity. Yuan et al. [26] propose a linear model of texture and depth distortions to estimate view synthesis distortions. Using the rate model in [28] and the distortion model in [29], they devise an R-D function in terms of quantization parameters. Then, they employ the Lagrangian multiplier method to estimate the optimal parameters to minimize the view synthesis distortion subject to a bit-rate constraint. They provide about 1.2dB PSNR gain against the H.264/MVC with a fixed bit-rate ratio, while reducing the complexity. However, they still require two additional encoding operations to estimate the parameters of their view synthesis distortion model. Moreover, these algorithms [24]–[26] try to minimize the distortions of synthesized virtual views only, even though encoded real views should also be available to users, as shown in Fig. 1. III. V IEW S YNTHESIS D ISTORTION E STIMATION We investigate the impacts of texture distortions and depth distortions on the qualities of synthesized virtual views. Based on the analysis, we propose a model to estimate view synthesis distortions, which requires neither the iterative coding of MVD data nor the actual synthesis of virtual views. A. Synthesis of Intermediate Virtual Views Let us consider the synthesis of an intermediate virtual view between two transmitted real views in an MVD coding system. Suppose that a right view S0 and a left view S1 are captured by two cameras of the same focal length f in a

Fig. 2. The synthesis of an intermediate virtual view Sk between two real views S0 and S1 . A 3D point P is projected onto pixels p and q on S0 and S1 , respectively. The corresponding pixel r on a virtual view Sk is synthesized from p and q bidirectionally.

parallel configuration, where the baseline distance between the cameras is 1, as shown in Fig. 2. Let P = (X, Y, Z ) be a 3D point in the world coordinate system, whose origin coincides with the camera center of the right view S0 . Let p and q be the projected pixels of P onto S0 and S1 , respectively. Also, let r be the projection of P onto a synthesized virtual view Sk , which lies between S0 and S1 and is at a distance of k from S0 . When we represent the positions of p and r using the local coordinates of the corresponding image planes, their y-coordinates are identical. The x-coordinates of p and r are f X/Z and f (X + k)/Z . Thus, the disparity d(p) at pixel p is given by d(p) = r − p = [ f k/Z , 0]T .

(1)

Depth data are typically quantized into 8-bit values to be compressed in video coding systems, such as H.264. They are non-uniformly quantized based on the plenoptic sampling theory [30]: the depth values of near objects are quantized with small step sizes, while those of far objects with large step sizes. Specifically, the 8-bit quantized description Zˆ of an actual depth Z at pixel p is given by 1 1 1 1 ˆ − / , (2) Z = round 255 · − Z Z far Z near Z far where Z near and Z far represent the nearest and the farthest depths. Note that Zˆ is an integer in the range of [0, 255] and equals 0 and 255 for the farthest and the nearest points, respectively. Hence, given a pixel value Zˆ in a quantized depth image, the disparity d(p) in (1) can be estimated as T 1 1 1 Zˆ ,0 . + − (3) d(p) = f k 255 Z near Z far Z far Based on the disparity, the virtual view Sk can be synthesized from the right view S0 . Let S0→k denote the synthesized


3257

virtual view from S0 , then we have S0→k (r) = S0→k (p + d(p)) = S0 (p)

(4)

where pixel r on Sk matches pixel p on S0 . Similarly, let d(q) be the disparity at q from S1 to Sk , estimated from the quantized depth value at q using (3). Then, Sk can be synthesized from S1 as S1→k (r) = S1→k (q − d(q)) = S1 (q).

(5)

Due to the occlusion, S0→k or S1→k may have empty pixels, which are not mapped from pixels in S0 or S1 . Those pixels are interpolated from neighboring pixels, based on the geometry proximity or by employing the inpainting method [31]. To render intermediate virtual views smoothly, Sk is then reconstructed by the weighted superposition of S0→k and S1→k , given by Sk (r) = (1 − k)S0→k (r) + k S1→k (r).

(6)

Let r = (x, y), and let Ds,k denote the mean squared distortion of Sk , which is defined as

1 Ds,k = {Sk (x, y) − S˜k (x, y)}2 d x d y (7) |Sk | (x,y)∈Sk

where |Sk | is the area of Sk , and S˜k denotes the reconstructed version of Sk . We approximate Ds,k as Ds,k = (1 − k)2 Ds,0→k + 2k(1 − k)Ds,joint + k 2 Ds,1→k (8) ≈ (1 − k)Ds,0→k + k Ds,1→k where Ds,0→k Ds,1→k Ds,joint

(9)

1 = {S0→k (x, y) − S˜0→k (x, y)}2 d x d y, (10) |Sk |

1 = {S1→k (x, y) − S˜1→k (x, y)}2 d x d y, (11) |Sk |

1 = {S0→k (x, y) − S˜0→k (x, y)} |Sk | {S1→k (x, y) − S˜1→k (x, y)}d x d y. (12)

The joint term Ds,joint is nonzero in general, and tends to be positive since the errors {S0→k (x, y) − S˜0→k (x, y)} and {S1→k (x, y) − S˜1→k (x, y)} are correlated. However, if we consider the joint term, it would make the proposed view synthesis distortion model too complicated to be used in applications. Thus, in (9), we assume that the joint term is zero. Instead, we approximately use the weights k and (1 − k) for Ds,0→k and Ds,1→k , respectively, which are bigger than the original weights k 2 and (1 − k)2 . B. Block Partitioning To estimate the distortion of a synthesized view, we decompose an input depth image into blocks and compute the distortion associated with each block first. Specifically, we classify blocks into two types: non-edge blocks and edge blocks. Pixels in a non-edge block have similar disparities, whereas an edge block contains abruptly changing disparities.

Fig. 3. Block partitioning results on (a) a texture image and (b) the corresponding disparity map in the “Cafe” sequence. Non-edge blocks and edge blocks are represented by white and green squares, respectively.

We convert an input depth image to the disparity map via (3), and compute the gradient magnitude G d (p) of disparity at pixel p along the horizontal direction, G d (p) = |dh (p) − dh (pright )|

(13)

where dh (p) and dh (pright ) are the horizontal disparities at pixel p and its right neighbor pright . We declare p as an edge pixel if G d (p) is larger than a threshold δd , and a non-edge pixel otherwise. Then, we regard each 2 × 2 block as an edge block if it contains strictly more edge pixels than non-edge pixels, and a non-edge block otherwise. If four adjacent 2 × 2 blocks have the same block type and their average texture values are similar to one another within a threshold δt , we merge them into a larger 4 × 4 block. This merging is iterated until the size of a merged block becomes 16 × 16. We set δd = 1 and δt = 20 empirically. In (13), we use disparity variations along the horizontal direction only to determine block types, since the cameras are assumed to be horizontally displaced and be in a parallel configuration as shown in Fig. 2. In such a case, pixels in different rows are not related in the view synthesis, and disparity variations along the vertical direction do not affect view synthesis distortions directly. Fig. 3 shows an example of the block partitioning on the “Cafe” sequence. Figs. 3(a) and (b) depict partitioned blocks in a texture image and the corresponding disparity map, in which green and white squares, respectively, represent edge and non-edge blocks. Most non-edge blocks occur inside objects or the background. On the other hand, edge blocks are distributed mainly around object boundaries, which exhibit varying texture values and abruptly changing disparities. Non-edge and edge blocks have different impacts on the view synthesis. Fig. 4(a) shows a texture image S0 and its corresponding disparity map, which are used to synthesize a virtual view via (4). In the real view S0 , we mark three non-edge blocks A1 , A2 , and A3 . In the virtual view S0→k , there are the corresponding blocks A1 , A2 , and A3 . In Fig. 4(b), A1 has almost identical disparities, and thus A1 is synthesized without structural distortions and has almost the same block size as A1 . In Figs. 4(c) and (d), A2 and A3 have slowly varying disparities along the horizontal direction. Therefore, the reconstructed blocks A2 and A3 have similar contents to the original blocks, even though the sizes are changed. On the other hand, we mark two edge blocks B1 and B2 in Fig. 4(a), which contain depth discontinuities. In Figs. 4(e) and (f), B1 and B2 have abruptly increasing and decreasing

3258


Fig. 5. Disparity relation between a non-edge block Bs,0 in a real view and its corresponding block Bs,0→k in a virtual view.

Therefore, the local coordinates of p and r are related by x = x + dh (x, y), y = y,

(14) (15)

where dh (x, y) is the horizontal disparity component at p. We assume that dh (x, y) is an affine function of x and y within Bd,0 . This assumption is reasonable, since the block partitioning scheme in Section III-B divides an image into blocks of various sizes from 2 × 2 to 16 × 16 so that each non-edge block has smoothly varying disparities. Specifically, we model dh (x, y) using a planar equation, given by Fig. 4. Virtual view synthesis on the “UndoDancer” sequence. (a) The right view S0 and its disparity map are used to reconstruct a virtual view S0→k . Three non-edge blocks A1 , A2 , A3 are marked in the real view S0 , and the corresponding blocks A1 , A2 , A3 are synthesized in the virtual view S0→k , as illustrated in (b), (c), (d). Also, two edge blocks B1 and B2 are marked, and the corresponding blocks B1 and B2 are synthesized in (e) and (f).

disparities, respectively, which make the synthesized blocks B1 and B2 much wider or narrower than the original blocks. To be more specific, due to the increasing disparities in B1 , many pixels in B1 do not have matching pixels in B1 , which are hard to reconstruct correctly using a simple interpolation technique. On the contrary, due to the decreasing disparities in B2 , some pixels in B2 are occluded in B2 and only parts of the original contents are reconstructed in B2 . This implies that large disparity variations in a block make it difficult to faithfully synthesize the corresponding block in a virtual view, and the distortions in such an edge block may have more adverse impacts on the view synthesis than those in a non-edge block. Therefore, we estimate the view synthesis distortion by deriving different distortion models for non-edge blocks and edge blocks separately. C. Distortion Models for View Synthesis First, we describe the view synthesis distortion model for non-edge blocks. Let Bs,0 denote a non-edge block in the real texture image S0 , and Bz,0 and Bd,0 be the corresponding depth block and disparity block, respectively. Also, Bs,0→k denotes the block in the virtual texture image S0→k , which is synthesized using Bs,0 and Bd,0 . Fig. 5 illustrates the relation between a pixel p = [x, y]T in Bs,0 and the corresponding pixel r = [x , y ]T in Bs,0→k . From (1), the disparity is estimated from S0 to Sk , which is given by d(p) = r − p. Also, the vertical disparity component is zero due to the assumption of the parallel camera configuration.

dh (x, y) = αx + βy + γ ,

(16)

where the plane parameters α, β, and γ are estimated for each block Bd,0 using the least squares method. In other words, we determine α, β, and γ that minimize ⎡ 2 ⎤ ⎤ ⎡ T dh (p1 ) p1 1 ⎡ ⎤ ⎢ dh (p2 ) ⎥ ⎢ pT 1⎥ α ⎢ ⎥ ⎣ ⎦ ⎥ ⎢ 2 (17) ⎢ .. ⎥ − ⎢ .. .. ⎥ β ⎣ . ⎦ ⎣ . .⎦ γ dh (p K ) pT 1 K

where K is the number of pixels in Bd,0 and dh (pi ) is the horizontal disparity at the i th pixel pi . Then, from (14) and (16), we obtain x = (1 + α)x + βy + γ .

(18)

Similarly, the texture values in a non-edge block Bs,0 tend to vary smoothly and exhibit high spatial correlations in general, since non-edge blocks are hierarchically merged together according to their similarities in a texture image in Section III-B. Therefore, we also model the texture value Bs,0(x, y) in an affine manner as Bs,0(x, y) = ζ x + ηy + ξ,

(19)

where the parameters ζ , η, and ξ are determined using the least squares method as well. The view synthesis equation in (4) can be rewritten, in terms of the two blocks Bs,0 and Bs,0→k , as Bs,0→k (x , y ) = Bs,0 (x, y).

(20)

From (15), (18), and (19), we have ζ Bs,0→k (x , y ) = (x − βy − γ ) + ηy + ξ. (21) 1+α In most applications, the virtual block Bs,0→k is synthesized from the reconstructed versions of the texture block Bs,0


3259

and the disparity block Bd,0 , rather than from their original versions. The reconstructed versions are distorted due to compression errors. Let B˜ s,0 (x, y) and d˜h (x, y) denote the distorted texture value and the distorted horizontal disparity, respectively. The distortions in B˜ s,0(x, y) and d˜h (x, y) affect the view synthesis. To derive a simple closed-form distortion model for the synthesized virtual block, we assume that the distortions in B˜ s,0(x, y) and d˜h (x, y) are associated with only the constant terms of γ in (16) and ξ in (19) and that the other parameters α, β, ζ , and η are error-free. If we consider the general case that they are also erroneous, the model becomes too complicated to be used in practical applications. Also, α, β, ζ , and η are determined by many AC coefficients through the cosine transform. Therefore, errors in α, β, ζ , and η tend to be negligible, since they are affected by various error sources, i.e., the quantization errors in many AC coefficients. Under this assumption, the synthesized texture value is distorted as B˜ s,0→k (x , y ) =

ζ (x − βy − γ˜ ) + ηy + ξ˜ 1+α

(22)

where γ˜ and ξ˜ denote the distorted γ and ξ . Then, we compute b of the synthesized virtual block the overall distortion Ds,0→k Bs,0→k as b Ds,0→k

2 1 Bs,0→k (x , y )− B˜ s,0→k (x , y ) d x d y = |Bs,0→k | (x ,y )∈Bs,0→k

=

1 |Bs,0→k |

(x ,y )∈Bs,0→k

=

1 |Bs,0|

ζ ξ − ξ˜ + (γ − γ˜ ) 1+α

ξ − ξ˜ +

(x,y)∈Bs,0

≈

1 |Bs,0|

ζ (γ − γ˜ ) 1+α

2 (1 + α) ξ − ξ˜ +

(x,y)∈Bs,0

ζ2 1+α

2

(23) d x d y

= (24)

2

(1 + α) d x d y

(25)

(γ − γ˜ )2 d x d y (26)

where |Bs,0→k | and |Bs,0| denote the areas of the synthesis block Bs,0→k and the original block Bs,0. We can rewrite (24) as (25), since d x = (1 + α) d x from (18) and d y = d y from (15). Also, note that (ξ − ξ˜ ) and (γ − γ˜ ) are the texture and disparity distortions, which are caused by the quantization of the corresponding DC coefficients. Thus, they can be assumed to be zero-mean white noise processes [32]. They can be also assumed to be uncorrelated with each other, as in [26]. Under these assumptions, the expectation of (ξ − ξ˜ )(γ − γ˜ ) is zero. Therefore, we approximate (25) to (26). b denote the distortion of the texture block B , and Let Ds,0 s,0 b Dd,0 denote the distortion of the disparity block Bd,0 . They are given by

2 1 b Bs,0 (x, y) − B˜ s,0(x, y) d x d y Ds,0 = |Bs,0| (x,y)∈Bs,0

Fig. 6. Graphical interpretation of the view synthesis distortion model, where b denotes the variation in the texture value. The texture synthesis error (Bs,0→k (x , y ) − B˜ s,0→k (x , y )) in the virtual view domain is induced by the texture error (Bs,0 (x, y) − B˜ s,0 (x, y)) and the disparity error (dh (x, y) − d˜h (x, y)) in the real view domain.

1 |Bs,0 |

(ξ − ξ˜ )2 d x d y,

(x,y)∈Bs,0

b Dd,0

1 = |Bd,0 |

Bd,0(x, y) − B˜ d,0 (x, y)

(27) 2 dx dy

(x,y)∈Bd,0

1 = |Bd,0 |

(γ − γ˜ )2 d x d y,

(28)

(x,y)∈Bd,0

where |Bd,0 | denote the area of the disparity block Bd,0 , and b in (26) can be rewritten as |Bd,0| = |Bs,0|. Then, Ds,0→k ζ2 Db . (29) 1 + α d,0 Fig. 6 shows a graphical interpretation of this view synthesis distortion model. As given in (23) and (24), the texture synthesis error (Bs,0→k (x , y )− B˜ s,0→k (x , y )) in the virtual view is ζ (γ − γ˜ ), which decomposed into the two terms: (ξ − ξ˜ ) and 1+α ˜ result from the texture distortion (Bs,0 (x, y) − Bs,0(x, y)) and the disparity distortion (dh (x, y) − d˜h (x, y)), respectively, in the real view. From this distortion model in (29), the following observations can be made: b • The view synthesis distortion Ds,0→k is more affected b by the texture distortion Ds,0 , if the disparity increases horizontally with a larger α. This is because a larger α makes the synthesized block Bs,0→k wider than the original block Bs,0 , as indicated by (18) and as illustrated in Fig. 4(d) and Fig. 6. b b Ds,0→k ≈ (1 + α)Ds,0 +

3260


Fig. 7. Illustration of weighting coefficients in the view synthesis distortion models: (a) weights for the texture distortion terms and (b) weights for the disparity distortion terms. The texture image and disparity map in Fig. 3 are used in this example.

On the contrary, a larger α lessens the impact of the b on the view synthesis distortion disparity distortion Dd,0 b , since the disparity error (γ − γ˜ ) is scaled by Ds,0→k ζ , as shown in Fig. 6, and then squared in the mean 1+α squared error (MSE) computation in (24). b contribute • A larger ζ makes the disparity distortion Dd,0 b more to Ds,0→k . For instance, when ζ = 0, the texture values are constant along the horizontal direction and the slope of the plane Bs,0 in Fig. 6 becomes zero. In such a case, the local disparity distortion does not affect the view synthesis. In other words, a disparity error is more tolerable, when a block has less texture variations. Next, we describe the view synthesis distortion model for edge blocks. As mentioned in Section III-B, edge blocks usually occur around object boundaries and have large texture variations and disparity discontinuities. Also, edge blocks often cause occlusion in synthesized virtual views. Thus, vertical edges in texture or disparity blocks incur severe view synthesis distortions in general. Therefore, we use the maximum disparity gradient as a weight for the texture distortion, and the squared maximum texture gradient for the disparity distortion. In other words, we approximate the view synthesis distortion for an edge block as 2 b b b = max G d (p) Ds,0 + max G s (p) Dd,0 (30) Ds,0→k

means that the horizontal disparity variations are negligible, i.e. α ≈ 0. In contrast, on edge blocks, the texture distortions are multiplied by large weights to compute the view synthesis distortions. Fig. 7(b) depicts the weights for the 2 disparity distortion 2 terms: ζ /(1 + α) for non-edge blocks and maxp∈Bs,0 G s (p) for edge blocks. Most non-edge blocks have weights near 0, while edge blocks have big weights, ranging from 200 to 1900, due to large variations in texture values. We can sum up the distortions of all blocks in an image using (29) and (30), and then represent the view synthesis distortion Ds,0→k of the entire virtual image in terms of the distortions Ds,0 and Dd,0 of the real texture image and disparity map, Ds,0→k = ψs,0→k Ds,0 + ψz,0→k Dd,0

•

p∈Bd,0

p∈Bs,0

where G d (p) is given in (13). Similarly, G s (p) is the horizontal texture gradient at pixel p, G s (p) = |Bs,0(p) − Bs,0(pright )|.

(31)

Notice that this model for edge blocks is inspired by our analysis in (29), in which the horizontal texture gradient ζ is squared for weighting the disparity distortion but the horizontal disparity gradient α is directly applied. Similarly, in (30), maxp∈Bs,0 G s (p) is squared, whereas maxp∈Bd,0 G d (p) is not. Fig. 7 shows the weighting coefficients of the proposed distortion models in (29) and (30), which are for the texture image and disparity map in Fig. 3. Fig. 7(a) visualizes the weights for the texture distortion terms: (1 + α) for non-edge blocks and maxp∈Bd,0 G d (p) for edge blocks. Most non-edge blocks have weights close to 1, which

(32)

where ψs,0→k and ψz,0→k are the resultant weighting parameters for Ds,0 and Dd,0 . Then, from the relation between disparity and depth in (3), Ds,0→k can be rewritten as Ds,0→k = ψs,0→k Ds,0 + ψz,0→k κk2 Dz,0 where κk =

fk 255

1 Z near

−

1 Z far

(33)

.

(34)

In a similar manner, the other distortion Ds,1→k from the left real view can be expressed as Ds,1→k = ψs,1→k Ds,1 + ψz,1→k κk2 Dz,1 .

(35)

Consequently, by substituting (33) and (35) into (9), we estimate the view synthesis distortion Ds,k by Ds,k = (1 − k)(ψs,0→k Ds,0 + ψz,0→k κk2 Dz,0 ) 2 Dz,1 ). + k(ψs,1→k Ds,1 + ψz,1→k κ1−k

(36)

It is worthwhile to note that, using (36), we can estimate the view synthesis distortion efficiently from the distortions of the real texture and depth images, without the iterative coding of MVD data nor the actual synthesis of virtual views. IV. B IT A LLOCATION FOR M ULTI -V IEW V IDEO P LUS D EPTH C ODING In this section, we propose a bit allocation algorithm for MVD sequences. We encode texture videos and depth videos by employing the 3D video coding method with the MVC configuration [33], [34], which is compatible with the MVV coding structure [9]. Given a bit budget for a frame, composed of a texture image and a depth image, we estimate the optimal bit-rates for a texture image and a depth image at the frame level using the view synthesis distortion model in (33) or (36). Note that the distortion model is not used inside the encoding loop at the block level. Also, the texture image and the depth image are encoded separately without any prediction dependency: depth data are not predicted from texture data, and texture data are not predicted from depth data. Fig. 8 illustrates the hierarchical B prediction structures [9] for encoding two-view and three-view sequences, respectively, in which each group of pictures (GOP) consists of eight


3261

TABLE I P ROPERTIES OF S IX T RAINING MVD S EQUENCES . F OR E ACH S EQUENCE , THE

GOP L ENGTH I S 8, AND THE N UMBER OF T OTAL F RAMES I S 97. T HE F RAME R ATE I S M EASURED IN F RAMES P ER S ECOND ( FPS )

Fig. 8. Hierarchical B prediction structures to encode MVD sequences: (a) two-view sequence and (b) three-view sequence.

temporally adjacent frames. The vertical direction represents different views, while the horizontal direction shows frames at different time instances. Let Vi denote the i th view, composed of a texture video Si and a depth video Z i . In Fig. 8(a), we first encode V0 as the base view, and then encode V1 as the prediction view by employing the disparity-compensated prediction from V0 to remove inter-view redundancies. Also, we encode each view with the motion-compensated prediction to remove intra-view redundancies. The first frames in GOPs are I0 and P0 in the base view and the prediction view, respectively, and they are called key frames. I0 is encoded with the intra-prediction, and P0 is predicted from I0 at the same time instance. The other frames are Bl ’s, where l denotes the hierarchical level in the prediction structure. All Bl ’s are temporally predicted from the preceding and subsequent frames in the increasing order of their hierarchical levels. Bl ’s in the prediction view are also predicted from Bl ’s at the same time instances in the base view. In Fig. 8(b), V1 is the base view, and V0 and V2 are the two prediction views. Given a total bit budget, we first divide it among views through the inter-view bit allocation. Then, within each view, we assign a bit-rate to each temporal frame through the intraview bit allocation. Finally, within a frame, we determine the optimal bit-rates for the texture image and the depth image. The proposed bit allocation algorithm can be applied to general configurations of arbitrary numbers of encoded real views and synthesized virtual views. However, for simplicity, we explain the proposed algorithm with the two-view configuration in Fig. 8(a), assuming that three virtual views are synthesized between the two real views. A. Inter-View Bit Allocation Let Rtotal be a total bit-rate, and Rview,i be a bit-rate allocated to the i th view Vi . Given Rtotal , we determine Rview,i empirically. We use six training sequences in Table I, which

Fig. 9. Inter-view bit consumption ratios for the six training sequences in Table I: (a) two-view encoding and (b) three-view encoding.

are encoded using the 3DV-ATM reference software version 5.0 [35] with four quantization parameters 26, 31, 36, and 41. When encoding two views of a training sequence as in Fig. 8(a), we use the right view and the left view as the base view V0 and the prediction view V1 , respectively. When encoding three views as in Fig. 8(b), we take the center view as the base view V1 , and the right and left views as the two prediction views V0 and V2 , respectively. After encoding the training sequences using the four quantization parameters, we measure the average ratio ωi of the consumed bit-rate Rview,i to the total bit-rate Rtotal. Then, given an arbitrary input video and a total bit-rate Rtotal , we determine the bit-rate for Vi by Rview,i = ωi × Rtotal.

(37)

Fig. 9 shows the average bit consumption results for the training sequences, from which we obtain the inter-view bit allocation parameters: ω0 = 0.65 and ω1 = 0.35 for two-view sequences, and ω0 = 0.25, ω1 = 0.47, and ω2 = 0.28 for three-view sequences. Note that the training sequences will not be employed as test sequences in the experiments in Section V. B. Intra-View Bit Allocation After the inter-view bit allocation, we divide the computed Rview,i among all frames within the i th view Vi . We employ the frame-level bit allocation method in [36], which assigns a bit-rate to each frame according to its level in the hierarchical B prediction structure in Fig. 8. Specifically, the weight factor υl for the lth level is given by υl = 1.0, 0.6, 0.4, 0.2 for l = 0, 1, 2, 3, respectively. Let Vi,t denote the tth frame in the i th view. Then, we allocate a bit-rate Rframe,i,t to Vi,t by υl Rframe,i,t = t × Rview,i t υlt where lt is the level of Vi,t .

(38)

3262


TABLE II C ODING PARAMETERS FOR THE 3DV-ATM R EFERENCE S OFTWARE [35] W ITH THE MVC C ONFIGURATION , W HICH A RE U SED IN THE

E XPERIMENTS

TABLE III P ROPERTIES OF S IX MVD T EST S EQUENCES . F OR E ACH S EQUENCE , GOP L ENGTH I S 8, AND THE N UMBER OF T OTAL F RAMES I S 97. T HE F RAME R ATE I S M EASURED IN F RAMES PER S ECOND ( FPS )

THE

TABLE IV F OUR D IFFERENT B IT B UDGETS IN K ILO B ITS P ER S ECOND ( KBPS ) FOR E ACH T EST S EQUENCE IN THE T WO -V IEW S CENARIO AND THE T HREE -V IEW S CENARIO

Fig. 10. Comparison of the R-D curves on the test sequences in the two-view scenario: (a) “Balloons,” (b) “BookArrival,” (c) “Lovebird1,” (d) “Pantomime,” (e) “Cafe,” and (f) “PoznanStreet.”

C. Texture and Depth Bit Allocation Finally, we distribute the allocated bit-rate Rframe,i,t of frame Vi,t to the texture image Si,t and the depth image Z i,t . For the sake of simplicity, we explain the proposed algorithm using the exemplar configuration in Fig. 8(a): two real views V0 and V1 are encoded, while three virtual texture views S1/4 , S2/4 , and S3/4 are synthesized. For simpler notations, we set i = 0 and omit the subscript t. We formulate the texture and depth bit allocation as a distortion minimization problem subject to the constraint of a given bit-rate Rframe,0 . We should minimize the distortions Ds,k , k ∈ = { 14 , 24 , 34 }, of the synthesized virtual views as well as the distortion Ds,0 of the encoded real view itself. Therefore, from (9), we have the constrained optimization problem (1 − k)Ds,0→k min Ds,0 +

the depth image Z 0 is represented well. We hence estimate Rs,0 and R-Q model [28], as μs Rs,0 (Q s ) = Qs μz Rz,0 (Q z ) = Qz

(39)

where Rs,0 and Rz,0 denote the bit-rates for the texture image S0 and the depth image Z 0 , respectively. To solve the optimization problem, we express the rate and distortion terms in (39) using the quantization step sizes Q s and Q z for S0 and Z 0 . Since depth values are 8-bit quantized,

+ νs ,

(40)

+ νz ,

(41)

where μs , νs , μz and νz are the R-Q model parameters, which are updated at each key frame using the linear regression technique [37]. Ds,0 and Dz,0 , which are the distortions of S0 and Z 0 , are estimated from Q s and Q z by employing the linear D-Q model [38], [39]: Ds,0 (Q s ) = ρs Q s , Dz,0 (Q z ) = ρz Q z ,

(42) (43)

where ρs and ρz are the D-Q model parameters and are also set using the linear regression technique [37]. Next, we represent the view synthesis distortion Ds,0→k also using Q s and Q z . From (42) and (43), Ds,0→k in (33) can be rewritten as Ds,0→k = ψs,0→k ρs Q s + ψz,0→k κk2 ρz Q z .

k∈

subject to Rs,0 + Rz,0 ≤ Rframe,0

as a gray-scale image as Rz,0 , based on the linear

(44)

Therefore, from (40) to (44), the constrained optimization problem in (39) is given by 2 (1 − k) ψs,0→k ρs Q s + ψz,0→k κk ρz Q z min ρs Q s + Q s ,Q z

subject to

k∈ μs Q −1 s

+ νs + μz Q −1 z + νz ≤ Rframe,0 .

(45)


3263

Fig. 11. Subjective comparison of the synthesized virtual view S 2 for the 78th frame in the “Cafe” sequence in the two-view scenario. The target bit-rate is 4

1, 700 kbps. The top and bottom rows show the synthesized virtual texture and the squared error map for an enlarged part of the virtual texture, respectively. We normalize the squared errors by dividing them by 2552 . (a) Ground truth, (b) Liu et al. [25], (c) Yuan et al. [26], (d) Liu et al. [23], and (e) the proposed algorithm. The PSNR’s of the synthesized virtual texture images are (b) 34.90dB, (c) 34.61dB, and (d) 33.33dB, and (e) 37.19dB, respectively.

TABLE V C OMPARISON OF THE R-D P ERFORMANCES IN T ERMS OF THE B JONTEGAARD M ETRIC [27] IN THE T WO -V IEW S CENARIO . T HE L IU et al.’ S A LGORITHM [23] I S U SED AS THE B ENCHMARK . T HE R ESULTS FOR R EAL AND V IRTUAL V IEWS A RE L ISTED O UTSIDE AND I NSIDE THE PARENTHESES , R ESPECTIVELY. T HE AVERAGE R ESULTS OVER A LL R EAL AND V IRTUAL V IEWS A RE R EPORTED W ITHIN THE S QUARE B RACKETS

to zero. They are given by μs + ρz μs μz ψ¯ z,0→k / ρs ψ¯ s,0→k ∗ , Qs = Rframe,0 − νs − νz ρs μz ψ¯ s,0→k ∗ Qz = × Q ∗s , ρz μs ψ¯ z,0→k where ψ¯ s,0→k = 1 + ψ¯ z,0→k =

(1 − k)ψs,0→k ,

(47) (48)

(49)

k∈

(1 − k)κk2 ψz,0→k .

(50)

k∈

TABLE VI C OMPARISON OF THE R-D P ERFORMANCES IN T ERMS OF THE B JONTEGAARD M ETRIC [27] IN THE T HREE -V IEW S CENARIO . T HE L IU et al.’ S A LGORITHM [23] I S U SED A S THE B ENCHMARK . T HE R ESULTS FOR R EAL AND V IRTUAL V IEWS A RE L ISTED O UTSIDE AND I NSIDE THE PARENTHESES , R ESPECTIVELY. T HE AVERAGE R ESULTS OVER A LL R EAL AND V IRTUAL

The quantization parameter QP can be computed from the quantization step size Q [41] via !! QP = max 0, min 51, round (6 log2 Q + 4) . (51) Therefore, from Q ∗s and Q ∗z , we find the optimal quantization parameters QP∗s and QP∗z for encoding the texture image S0 and the depth image Z 0 , respectively.

V IEWS A RE R EPORTED W ITHIN THE S QUARE B RACKETS

V. S IMULATION R ESULTS

We obtain the optimal solution to (45) by minimizing the Lagrangian cost function [40], J (Q s , Q z , λ) =

ρs Q s +

2 (1 − k) ψs,0→k ρs Q s + ψz,0→k κk ρz Q z

k∈

−1 + λ μs Q −1 s + νs + μz Q z + νz − Rframe,0 ,

(46)

where λ is a Lagrangian multiplier. We derive the optimal step sizes, Q ∗s and Q ∗z , by taking the partial derivatives of J (Q s , Q z , λ) with respect to Q s , Q z , and λ and setting them

We implement the proposed algorithm by employing the 3DV-ATM reference software version 5.0 [35] with the MVC configuration including depth videos. Table II summarizes the coding parameters that are used in the experiments. We evaluate the performance of the proposed algorithm on six MVD test sequences, which have been provided for the 3D video coding exploration experiments [42]: “Balloons,” “BookArrival,” “Lovebird1,” “Pantomime,” “Cafe,” and “PoznanStreet.” Each sequence is composed of multiple views, among which two or three views are selectively encoded in our experiments. Table III summarizes the properties of the test sequences and lists the selected views. When encoding two views, we encode the right view as the base view and the left view as the prediction view. When encoding three views, we encode the center view as the base view and the right and left views as the prediction views. Intermediate virtual views are synthesized using the view synthesis reference software 3.0 [43].

3264


Fig. 12. Subjective comparison of synthesized virtual texture views in the three-view scenario. From top to bottom, the first two rows show the 61st frame in S 1 of the “Lovebird1” sequence, the next two show the 46th frame in S 3 of the “BookArrival” sequence, and the last two show the 52nd frame in S 1 of the 2

2

2

“Street” sequence, at the bit-rates of 1,300, 2,700, and 2,200 kbps, respectively. In the squared error maps, we normalize each error by dividing it by 2552 . (a) Ground truth, (b) Liu et al. [25], (c) Yuan et al. [26], (d) Liu et al. [23], and (e) the proposed algorithm. The PSNR’s of the synthesized virtual texture images are (b) 31.65dB, (c) 32.02dB, (d) 32.09dB, and (e) 33.27dB on the “Lovebird1” sequence; (b) 34.38dB, (c) 34.99dB, (d) 33.35dB, and (e) 35.72dB on the “BookArrival” sequence; and (b) 30.59dB, (c) 30.09dB, (d) 29.88dB, and (e) 31.53dB, on the “PoznanStreet” sequence.

A. Rate-Distortion Performance We evaluate the R-D performance of the proposed MVD coding algorithm in two scenarios. In the two-view scenario, the right view V0 and the left view V1 are encoded as real views, and three intermediate texture views S 1 , S 2 , and S 3 4 4 4 are synthesized as virtual views. In the three-view scenario, the right view V0 , the center view V1 , and the left view V2 are real, and two virtual views S 1 and S 3 are synthesized from 2 2 the two nearest real ones, respectively. We compare the performance of the proposed algorithm with those of three conventional algorithms: Liu et al. [25], Yuan et al. [26], and Liu et al. [23]. In each scenario, we perform the encoding with four different bit budgets in Table IV.

As the ground truth of a virtual view, we use the synthesized view from the texture and depth images of the original real views. Then, we measure the PSNR of a distorted virtual view, which is synthesized from reconstructed texture and depth images, by comparing it with the ground truth. Moreover, we measure the PSNR of each reconstructed real view. Figs. 10 shows the R-D curves of the proposed algorithm and the conventional algorithms in the two-view scenario. The average PSNRs of all reconstructed real and virtual views are plotted. The proposed algorithm provides significantly higher PSNRs than the conventional algorithms at the same bit-rates. Note that [23] allocates bit-rates to texture and depth images according to the fixed ratio of 4 : 1. On the other hand, [25] and [26] perform the texture and depth bit allocation


by estimating the distortions of synthesized virtual views, but they consider a single virtual view only and determine the model parameters empirically. Also, none of the conventional algorithms consider the distortions of encoded real views. In contrast, the proposed algorithm more accurately estimates the distortions of synthesized views, which are mathematically decomposed into the distortions of encoded texture and depth images. In addition, the proposed algorithm considers the view synthesis distortions at multiple virtual viewpoints, as well as the distortions of real views, in the rate-distortion optimization framework. The proposed algorithm hence allocates a limited bit budget more efficiently to texture and depth images, outperforming the conventional algorithms. Next, we employ the Bjontegaard method [27] to measure the average bit-rate and PSNR differences between the R-D curves of the proposed algorithm and the conventional algorithms. Tables V and VI provide the Bjontegaard evaluation results. Note that Tables V corresponds to the R-D curves in Fig. 10. The results for the real views and the virtual views are reported separately. The Liu et al.’s algorithm [23] is used as the benchmark. On average, the proposed algorithm reduces the bit-rate by about 28.69% and 30.24%, while increasing the average PSNR by about 1.98dB and 2.04dB, in the two-view scenario and the three-view scenario, respectively. Also, note the Yuan et al.’s algorithm [26] and the Liu et al.’s algorithm [25] yield worse performances than the proposed algorithm. Fig. 11(a) shows the ground truth virtual view S 2 between 4 two real views V0 and V1 for the 78th frame in the “Cafe” sequence. Figs. 11(b), (c), (d), and (e) are the distorted virtual views, which are obtained by [25], [26], [23], and the proposed algorithm at the bit-rate of 1, 700 kbps in the two-view scenario, respectively. The bottom row in Fig. 11 compares the squared errors between the ground truth virtual view and the synthesized virtual views. We see that the proposed algorithm yields much smaller errors than the conventional algorithms, especially around the letters on the wall. Fig. 12 compares the synthesized virtual views, which are reconstructed by the proposed algorithm and the conventional algorithms in the three-view scenario. The 61st frame in S 1 2 of the “Lovebird1” sequence, the 46th frame in S 3 of the 2 “BookArrival” sequence, and the 52nd frame in S 1 of the 2 “PoznanStreet” sequence are reconstructed at the bit-rates of 1,300, 2,700, and 2,200 kbps, respectively. The conventional algorithms yield artifacts in the wall on the “Lovebird1” sequence, whereas the proposed algorithm reconstructs the scene structures more faithfully. On the “BookArrival” sequence, the conventional algorithms yield severe errors around the left man, but the proposed algorithm reduces the errors effectively. Similarly, on the “PoznanStreet” sequence, the proposed algorithm provides a higher quality virtual view. B. Computational Complexity To compare the computational complexities of the proposed algorithm and the conventional algorithms, we perform the experiments on a PC with a 3.30Gz Quad Core Processor and 4GB RAM. The Liu et al.’s algorithm [25] encodes

3265

Fig. 13. Comparison of the computational times to encode the entire sequences: (a) two-view and (b) three-view scenarios.

each texture image several times and each depth image three times. In our implementation, [25] encodes a texture image six times. The Yuan et al.’s algorithm [26] encodes both texture and depth images twice to estimate their model parameters. On the other hand, the Liu et al.’s algorithm [23] and the proposed algorithm encode both texture and depth images only once. Figs. 13(a) and (b) compare the computational times of the proposed algorithm and the conventional algorithms in the two-view scenario and the three-view scenario, respectively. When [23] is used as a benchmark, [26], [25], and the proposed algorithm take about 2.7, 4.1, and 1.2 times longer computational time, respectively. Notice that the proposed algorithm achieves a comparable complexity to the Liu et al.’s algorithm [23], but provides significantly better R-D performance. VI. C ONCLUSIONS We proposed a novel view synthesis distortion model and an efficient bit allocation algorithm for MVD sequences. We first modeled the impacts of the distortions of texture and depth images on the qualities of synthesized virtual views. Then, we formulated the bit-rate allocation problem by combining the linear R-Q and D-Q models with the proposed view synthesis distortion model. We determined the optimal quantization parameters for texture and depth images, in order to minimize the distortions of real and virtual views subject to a bit-rate constraint. Simulation results demonstrated that the proposed algorithm yields a significantly better R-D performance than the conventional algorithms, while requiring a comparable or lower computational complexity. In this work, we used the MSE criterion to measure the distortions of synthesized views objectively. It is one of our

3266


future research issues to develop a perceptual view synthesis distortion model to quantify the effects of texture errors and depth errors on subjective qualities of synthesized views. Also, we will apply the proposed algorithm to other video codecs, such as 3D-HEVC [44], as well.

[25] Y. Liu, Q. Huang, S. Ma, D. Zhao, and W. Gao, “Joint video/depth rate allocation for 3D video coding based on view synthesis distortion model,” Signal Process., Image Commun., vol. 24, no. 8, pp. 666–681, Sep. 2009. [26] H. Yuan, Y. Chang, J. Huo, F. Yang, and Z. Lu, “Model-based joint bit allocation between texture videos and depth maps for 3-D video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 4, pp. 485–497, Apr. 2011. [27] Calculation of Average PSNR Differences Between RD Curves, document ITU-T Q6/SG16 and VCEG-M33, Apr. 2001. [28] S. Ma, W. Gao, and Y. Lu, “Rate-distortion analysis for H.264/AVC video coding and its application to rate control,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 12, pp. 1533–1544, Dec. 2005. [29] H. Wang and S. Kwang, “Rate-distortion optimization of rate control for H.264 with adaptive initial quantization parameter determination,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 1, pp. 140–144, Jan. 2008. [30] J.-X. Chai, X. Tong, S.-C. Chan, and H.-Y. Shum, “Plenoptic sampling,” in Proc. 27th Annu. Conf. Comput. Graph. Interact. Techn., Jul. 2000, pp. 307–318. [31] A. Telea, “An image inpainting technique based on the fast marching method,” J. Graph. Tools, vol. 9, no. 1, pp. 25–36, 2004. [32] L. Xiao, M. Johansson, H. Hindi, S. Boyd, and A. Goldsmith, “Joint optimization of communication rates and linear systems,” IEEE Trans. Automat. Control, vol. 48, no. 1, pp. 148–153, Jan. 2003. [33] 3D-AVC Test Model 4, document Joint Collaborative Team on 3D Video Coding Extension Development of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCT3V-B1003, Oct. 2012. [34] Work Plan in 3D Standards Development, document Joint Collaborative Team on 3D Video Coding Extension Development of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCT3V-B1006, Oct. 2012. [35] Test Model for AVC Based 3D Video Coding, document ISO/IEC JTC1/SC29/WG11 and MPEG 2012/N12558, Feb. 2012. [36] Rate Control Reorganization in the Joint Model (JM) Reference Software, document Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-W042, Apr. 2007. [37] H.-J. Lee, T. Chiang, and Y.-Q. Zhang, “Scalable rate control for MPEG-4 video,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 6, pp. 878–894, Sep. 2000. [38] H. Wang and S. Kwong, “A rate-distortion optimization algorithm for rate control in H.264,” in Proc. IEEE ICASSP, vol. 1. Apr. 2007, pp. 1149–1152. [39] H. Wang and S. Kwong, “Rate-distortion optimization of rate control for H.264 with adaptive initial quantization parameter determination,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 1, pp. 140–144, Jan. 2008. [40] H. Everett III, “Generalized Lagrange multiplier method for solving problems of optimum allocation of resources,” Oper. Res., vol. 11, no. 3, pp. 399–417, 1963. [41] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, document Joint Video Team, ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, May 2003. [42] Draft Call for Proposals on 3D Video Coding Technology, document ISO/IEC JTC1/SC29/WG11 and MPEG2011/N11830, Feb. 2011. [43] Reference Softwares for Depth Estimation and View Synthesis, document ISO/IEC JTC1/SC29/WG11 and MPEG2008/M15377, Apr. 2008. [44] K. Müller et al., “3D high-efficiency video coding for multi-view video and depth data,” IEEE Trans. Image Process., vol. 22, no. 9, pp. 3366–3378, Sep. 2013.

R EFERENCES [1] M. Lambooij, W. IJsselsteijn, M. Fortuin, and I. Heynderickx, “Visual discomfort and visual fatigue of stereoscopic displays: A review,” J. Imag. Sci. Technol., vol. 53, no. 3, pp. 1–14, 2009. [2] M. J. Tovée, An Introduction to the Visual System. Cambridge, U.K.: Cambridge Univ. Press, 1996. [3] I. P. Howard and B. J. Rogers, Binocular Vision and Stereopsis. London, U.K.: Oxford Univ. Press, 1995. [4] K. Perlin, S. Paxia, and J. S. Kollin, “An autostereoscopic display,” in Proc. 27th Annu. Conf. Comput. Graph. Interact. Techn., Jul. 2000, pp. 319–326. [5] I. Sexton and P. Surman, “Stereoscopic and autostereoscopic display systems,” IEEE Signal Process. Mag., vol. 16, no. 3, pp. 85–99, May 1999. [6] N. A. Dodgson, “Autostereoscopic 3D displays,” IEEE Comput., vol. 38, no. 8, pp. 31–36, Aug. 2005. [7] I. P. Howard and B. J. Rogers, Seeing in Depth: Depth Perception, vol. 2. South Devon, England: I. Porteous, 2002. [8] M. Tanimoto, “Overview of free viewpoint television,” Signal Process., Image Commun., vol. 21, no. 6, pp. 454–461, Jul. 2006. [9] P. Merkle, A. Smoli´c, K. Müller, and T. Wiegand, “Efficient prediction structures for multiview video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1461–1473, Nov. 2007. [10] A. Smolic et al., “Coding algorithms for 3DTV—A survey,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1606–1621, Nov. 2007. [11] A. Redert et al., “ATTEST: Advanced three-dimensional television system technologies,” in Proc. 1st Int. Symp. 3DPVT, Jun. 2002, pp. 313–319. [12] C. Fehn, “Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV,” in Proc. Stereoscopic Displays Virtual Reality Syst. XI, May 2004, pp. 93–104. [13] K. Müller, P. Merkle, and T. Wiegand, “3-D video representation using depth maps,” Proc. IEEE, vol. 99, no. 4, pp. 643–656, Apr. 2010. [14] P. Kauff et al., “Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability,” Signal Process., Image Commun., vol. 22, no. 2, pp. 217–234, Feb. 2007. [15] Multi-View Video Plus Depth MVD Format for Advanced 3D Video Systems, document Joint Video Team (JVT) of ISO/IEC MPEG ITU-T VCEG, and JVT-W100, Apr. 2007. [16] P. Merkle, A. Smolic, K. Müller, and T. Wiegand, “Multi-view video plus depth representation and coding,” in Proc. IEEE ICIP, vol. 1. Sep. 2007, pp. 201–204. [17] W.-S. Kim, A. Ortega, P. Lai, D. Tian, and C. Gomila, “Depth map distortion analysis for view rendering and depth coding,” in Proc. 16th IEEE ICIP, Nov. 2009, pp. 721–724. [18] W.-S. Kim, A. Ortega, P. Lai, D. Tian, and C. Gomila, “Depth map coding with distortion estimation of rendered view,” Proc. SPIE, vol. 7543, pp. 1–10, Jan. 2010. [19] Q. Zhang, P. An, Y. Zhang, and Z. Zhang, “Efficient rendering distortion estimation for depth map compression,” in Proc. 18th IEEE ICIP, Sep. 2011, pp. 1129–1132. [20] T.-Y. Chung, W.-D. Jang, and C.-S. Kim, “Efficient depth video coding based on view synthesis distortion estimation,” in Proc. IEEE VCIP, Nov. 2012, pp. 1–4. [21] G. Tech, H. Schwarz, K. Müller, and T. Wiegand, “3D video coding using the synthesized view distortion change,” in Proc. Picture Coding Symp., May 2012, pp. 25–28. [22] E. Bosc, V. Jantet, M. Pressigout, L. Morin, and C. Guillemot, “Bit-rate allocation for multi-view video plus depth,” in Proc. 3DTV Conf., True Vis. Capture, Transmiss. Display 3D Video, May 2011, pp. 1–4. [23] Y. Liu et al., “A novel rate control technique for multiview video plus depth based 3D video coding,” IEEE Trans. Broadcast., vol. 57, no. 2, pp. 562–571, Jun. 2011. [24] Y. Morvan, D. Farin, and P. N. de With, “Joint depth/texture bit-allocation for multi-view video compression,” in Proc. Picture Coding Symp., Nov. 2007, pp. 43–49.

Tae-Young Chung (S’08–M’14) received the B.S. and Ph.D. degrees from the School of Electrical Engineering, Korea University, Seoul, Korea, in 2006 and 2013, respectively. He is currently with Software Research and Development Center, Samsung Electronics Company, Ltd., Suwon, Korea. His current research interests include error resilient coding, multiview video coding, stereo matching, and computer vision.


Jae-Young Sim (S’02–M’06) received the B.S. degree in electrical engineering and the M.S. and Ph.D. degrees in electrical engineering and computer science from Seoul National University, Seoul, Korea, in 1999, 2001, and 2005, respectively. From 2005 to 2009, he was a Research Staff Member with the Samsung Advanced Institute of Technology, Samsung Electronics Company, Ltd., Yongin, Korea. In 2009, he joined the School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology, Ulsan, Korea, where he is currently an Associate Professor. His research interests include image and 3D visual signal processing, multimedia data compression, and computer vision.

3267

Chang-Su Kim (S’95–M’01–SM’05) received the Ph.D. degree in electrical engineering from Seoul National University (SNU), Seoul, Korea, with a Distinguished Dissertation Award in 2000. From 2000 to 2001, he was a Visiting Scholar with the Signal and Image Processing Institute, University of Southern California at Los Angeles, Los Angeles, CA, USA. From 2001 to 2003, he coordinated the 3D Data Compression Group at the National Research Laboratory for 3D Visual Information Processing, SNU. From 2003 and 2005, he was an Assistant Professor with the Department of Information Engineering, Chinese University of Hong Kong, Hong Kong. In 2005, he joined the School of Electrical Engineering, Korea University, Seoul, where he is currently a Professor. His current research interests include image processing and multimedia communications. He was a recipient of the IEEK/IEEE Joint Award for Young IT Engineer of the Year in 2009. He has authored more than 210 technical papers in international journals and conferences. He is an Editorial Board Member of the Journal of Visual Communication and Image Representation and an Associate Editor of the IEEE T RANSACTIONS ON I MAGE P ROCESSING.

Efficient multiview depth coding optimization based on allowable depth distortion in view synthesis.

upsampling.

Machine learning-based coding unit depth decisions for flexible complexity allocation in high efficiency video coding.

A bit allocation method for sparse source coding.

An analytical model for synthesis distortion estimation in 3D video.

Reference View Selection in DIBR-Based Multiview Coding.

Low-complexity saliency detection algorithm for fast perceptual video coding.

Suboptimal greedy power allocation schemes for discrete bit loading.

A Novel Joint Power and Feedback Bit Allocation Interference Alignment Scheme for Wireless Sensor Networks.

Arbitrarily shaped motion prediction for depth video compression using arithmetic edge coding.

Motion-aware mesh-structured trellis for correlation modelling aided distributed multi-view video coding.

Video Synchronization With Bit-Rate Signals and Correntropy Function.

Multidimensional QoE of Multiview Video and Selectable Audio IP Transmission.

Improving Video Segmentation by Fusing Depth Cues and the Visual Background Extractor (ViBe) Algorithm.

A Modified Differential Coherent Bit Synchronization Algorithm for BeiDou Weak Signals with Large Frequency Deviation.

Depth-resolved analytical model and correction algorithm for photothermal optical coherence tomography.

A novel weighted total difference based image reconstruction algorithm for few-view computed tomography.

BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithm.

Adaptive Multiview Nonnegative Matrix Factorization Algorithm for Integration of Multimodal Biomedical Data.

Foreground segmentation in depth imagery using depth and spatial dynamic models for video surveillance applications.

An in depth view of avian sleep.

Surgeons' point-of-view video recording technique for scleral buckling.

Coding depth perception from image defocus.

A watermarking scheme for High Efficiency Video Coding (HEVC).