Depth director: a system for adding depth to movies.

Camera Culture

Depth Director: A System for Adding Depth to Movies Ben Ward ■ University of Adelaide Sing Bing Kang and Eric P. Bennett

R

■

Microsoft

ecent years have seen renewed interest in 3D movies due partly to technological ad vances in the underlying 3D display tech nologies. At the same time, 3D films’ quality has improved, with depth cues serving more as a story telling device and less as a gimmick to drive the ater attendance. However, the stream of new 3D films is slow owing to the additional overhead of using multiple cameras and the expertise required to use 3D in a compelling man ner (or at least not induce mo Depth Director is an tion sickness or eyestrain). interactive system for One solution to this content converting 2D footage to problem, which is seen as in 3D. It integrates recent hibiting wider adoption of 3D technology, is to convert new computer vision advances and legacy 2D films into 3D. Al with specialized tools that though 3D CG animated films let users accurately recreate have accomplished this in a or stylistically manipulate 3D straightforward manner, the pro depths. cess is complex, time-consuming, and therefore expensive for liveaction films. This is because the standard technique involves manual rotoscoping (selecting outlines of objects in each frame) followed by setting depths and hand inpainting near occluded objects. (For more on research on converting 2D films to 3D, see the related sidebar.) The logical solution to this problem would in volve using computer vision techniques to auto matically synthesize 3D video from 2D video. However, computer vision algorithms assume static scenes and moving cameras, which are atyp ical. Also, even if the algorithms were perfectly correct, the actual 3D depths might not be what the film director desires. 36

January/February 2011

Our Depth Director system addresses both these issues. It automates rotoscoping and depth extraction, while giving users high-level directo rial control over the rotoscoping and depths. Users can visually emphasize or deemphasize parts of a scene, using a stroke-based interface to iteratively refine the computer vision results. Figure 1 shows original 2D input and Depth Director’s results. The system workflow comprises three stages: anal ysis, directing, and rendering (see Figure 2).

Analysis Depth Director begins with automated preprocess ing to aid the UI and to seed the depth values if the clip is amenable to computer vision depth-recovery techniques. To help determine which pixels belong to which region, Depth Director performs a consistent segmentation using C. Lawrence Zitnick and his colleagues’ method,1 which jointly segments the video and computes optical flow. Specifically, this is an oversegmentation to obtain a large set of seg ments for each frame, in which each segment is linked to its corresponding segment in the next and previous frame. Performing operations on the level of these clus ters of pixels, rather than individual pixels, signifi cantly reduces the processing required, helping to achieve real-time interaction. While performing depth assignment on the segment level limits the assignment’s accuracy, in our experiments we found that with sufficiently fine segments, few perceptual differences exist between results obtained with pixel-level depth assignment and with segmentlevel depth assignment. Our experiments show that the target segment size should be fine enough to

Published by the IEEE Computer Society

0272-1716/11/$26.00 © 2011 IEEE

Related Work on 2D-to-3D Film Conversion

C

urrent methods for converting 2D films to 3D are too manually intensive or imprecise to gain wide acceptance. For example, In-Three (www.in-three.com), which converted 26 minutes of Superman Returns to 3D for IMAX, uses (as far as we know) a largely manual process.1 In contrast, Moshe Guttman and his colleagues’ system requires user interaction in only a sparse set of frames.2 Users mark frames with scribbles indicating disparity, and the system propagates these values throughout the video. This system is minimally interactive but doesn’t allow for full directorial control over the resulting depths. Alternatively, current techniques for fully automated 2D-to-3D conversion don’t achieve a level of control or robustness that makes them useful in a production environment. For example, in video featuring significant horizontal movement, you can achieve a crude 3D effect by timeshifting the frames presented to each eye, creating a horizontal displacement in each view. This effect has seen use in software promising real-time 2D-to-3D video conversion (for example, 3Dplus; www.stereo3d.com/3dplus_software. htm). A more sophisticated version of this technique rectifies image pairs to minimize vertical disparity.3 For video shot with a moving camera, structure from motion4 can recover camera parameters for each frame of a video sequence and recover sparse 3D structure in the form of a point cloud.

With recovered camera parameters and a sufficient range of views of a scene, automatic 3D reconstruction of a static scene might be possible. However, even for our case, which involves generating a new view with a slight deviation in position from the original camera, the depth information isn’t always reliable. Interactive video-based systems (such as VideoTrace5) let users rotoscope and reconstruct objects, but they still assume a moving camera and static scene (or objects).

References 1. A.P.V. Pernis and M.S. DeJohn, “Dimensionalization: Conver ting 2D Films to 3D,” Proc. SPIE (Stereoscopic Displays and Applications XIX), vol. 6803, 2008. 2. M. Guttman, L. Wolf, and D. Cohen-Or, “Semi-automatic Stereo Extraction from Video Footage,” Proc. 12th IEEE Int’l Conf. Computer Vision (ICCV 09), IEEE Press, 2009, pp. 136–142. 3. X. Li et al., “Creating Stereoscopic (3D) Video from a 2D Monocular Video Stream,” Advances in Visual Computing, LNCS 4841, Springer, 2007, pp. 258–267. 4. R. Hartley and A. Zisserman, Multiple View Geometry in Com puter Vision, Cambridge Univ. Press, 2004. 5. A. Van Den Hengel et al., “VideoTrace: Rapid Interactive Scene Modelling from Video,” ACM Trans. Graphics, vol. 26, no. 3, 2007, article 86.

(a)

(b) Figure 1. Sample frames converted to 3D using Depth Director. (a) Original 2D input. (b) Output (in anaglyph format; red is left and blue is right). Our system is able to effectively convert dynamic scenes.

preserve important structural information (that is, segments shouldn’t contain pixels belonging to two distinct objects). Segment boundaries should also be consistent between frames, so that depth discon tinuities occur at corresponding pixels from frame to frame. Where segment boundaries are notice able, we can hide them with a pixel-level smoothing operation, as we describe later.

Structure from motion (SFM) analysis can be an optional preprocessing step for scenes with a mov ing camera. We extract features in each frame of a sequence and find matches between features in multiple frames, giving the motion of points in the scene over the sequence. With sufficient camera movement and a largely static scene, we can reli ably reconstruct both the camera’s position over IEEE Computer Graphics and Applications

37

Camera Culture 2D input video

Structure from motion (SFM)

Analysis

Consistent segmentation

video generation, using a technique based on a Markov random field (MRF) to propagate depths from sparse SFM data to all the pixels in the video. Refined SFM results

Segmentation and depths

Create or modify regions Assign region depths

Apply depth templates Smoothing

Directing UI

Flatten regions

User control and guidance

Rounding Set rendering parameters Regions, depths, and settings

View generation

Rendering

Matting

Directing This stage is Depth Director’s core; it gives the user the tools necessary to specify how to convert 2D video into 3D. Our approach uses two displays (see Figure 3a). The primary display shows the UI and the source video (in the original 2D, a filmstrip view, and a perspective view). The other display is a 3D display providing interactive views of the video as the film audience would experience them. Our prototype system uses a thin-film-transistor active-matrix LCD (Hyundai P240W) display that requires polarized glasses, but it also supports ana glyphs on a traditional display. The UI (see Figure 3b) aims to focus on the us er’s high-level tasks and to abstract away the un derlying algorithms. In this manner, this stage is about crafting how a shot should look by directly manipulating both the scene’s constituent objects and the camera itself.

Object-Level Control Left-eye video

Right-eye video

Figure 2. Depth Director system workflow. The system comprises three stages: analysis, directing, and rendering.

the sequence and the feature points’ 3D locations, giving camera parameters for each frame and a sparse 3D point cloud of features. The resulting feature points have a 3D position, color, and a list of correspondences giving frames in which each point is visible. We can use the features and cam eras recovered by SFM to partially automate 3D

(a)

The user’s goal is to manipulate the individual depths of each visual element in a scene across all the frames of the clip. This involves first specify ing what constitutes each object (that is, a region comprising many smaller segments) and then ma nipulating those object’s depth values. Creating regions. Users can select a group of seg ments forming a connected region by roughly marking the region’s interior through strokes (see Figure 4). The system selects segments containing any marked pixels. The initial coarse selection ini tiates a graph cut segmentation based on Carsten Rother and his colleagues’ method.2 Graph cut

(b)

Figure 3. Depth Director. (a) The two displays, with the left display in 2D and the right in 3D for immediate feedback. (b) The UI. The left-most column contains the controls for adding depth. The left view is the frame being edited; the user can scroll along time using the controls below it. Depth Director renders the right view at a user-specified viewpoint to visualize the frame’s depth distribution. 38


techniques are a common solution for image and video segmentation problems, performing seg mentation based on user-supplied samples of fore ground and background pixels. These techniques allow accurate segmentations from minimal user interaction, significantly easing what could be an extremely time-consuming process. Depth Director constructs a foreground Gauss ian mixture model (GMM) for the marked seg ments’ colors and constructs a background GMM for unmarked segments in a bounding box around the marked segments. We use models with five components. A graph cut labels segments as fore ground or background on the basis of these color models and the segments’ similarity to neighbor ing segments. The system assigns a node in the graph to each segment in a frame, with edges to the nodes for neighboring segments. The graph cut minimizes the energy function E(a, x) = lU(a, x) + V(a, x), where a is the set of node labels and x is the im age color data. V is a data term encouraging labels that fit the color models. U is a smoothness term encouraging the cut to follow edges in the image by penalizing the Euclidean color space distance between pixels with different labels. l is a con stant weighting between the two terms, set to 50. We define U as U (α, x ) =

∑

δ (α p , αq )

( p,q)∈M

 − x a − xb exp   2β (a ,b)∈D 

∑

2

   , 

where M is the set of all neighboring nodes, D is the set of neighboring pixels in nodes p and q, b 2 is the expected value of x a − xb (computed over all neighboring pixels in the image), and 1 δ (α p , αq ) =  0

if α p ≠ αq otherwise .

We define V as V (α, x) = 5  1 1 − wi  Σi det i=1   T    −( x n − µi ) exp   2  

∑

∑

n∈N

∑

    −1   ( x n − µi )  i  ,  

(a)

(b)

(c)

Figure 4. Region selection. (a) The original image. (b) The user draws scribbles on the selected object. (c) Depth Director immediately highlights the object.

where N is the set of nodes in the graph and wi, mi, Si, and detSi are the weight, mean, covariance, and determinant of the covariance for the ith component of the appropriate GMM. After the initial segmentation, Depth Director updates the color models with the new labelings, and the segmentation is iteratively optimized un til convergence. Marked segments receive heavy weights in the graph to enforce the user-supplied labeling. Depth Director segments the clusters generated during preprocessing. For the data term, it uses the mean cluster color. For the smoothness term, it sums gradient values along the boundaries of neighboring segments during preprocessing. Op erating on clusters rather than pixels significantly reduces the graph’s size, allowing computation of the whole sequence’s segmentation at interactive speed. Depth Director uses the frame-to-frame links generated during preprocessing to propagate cut results between frames. It propagates labels to the next frame in the sequence. The system then uses these labels to initialize the graph cut in the next frame. It doesn’t propagate the information to segments that don’t fit the foreground/back ground model—that is, segments for which the GMMs for the current frame give a greater likeli hood for the alternate labeling. Users can further refine the segmentation by us ing positive and negative strokes to add foreground and background samples. In particular, refinement might be required for selecting objects moving past a variable background, where the background color distribution in one frame doesn’t necessarily pre dict the distribution in the subsequent frame. We chose a stroke-based interface over other options, such as selection by dragging a box around a region, for greater flexibility, particularly in selecting thin regions that aren’t aligned with the frame. IEEE Computer Graphics and Applications

39

Camera Culture

Dense Depth from Structure-from-Motion Data

T

echniques for extracting dense depth maps from multiple images through stereo1 require a static scene and are computationally expensive. In our case, there might be independently moving objects. We found that sparse structure from motion (SFM) data was suitable for our system, which assigns depth largely on a per-segment, rather than per-pixel, level. Incorporating sparse SFM data and sparse user-supplied data into the same framework is straightforward. We directly use sparse SFM data or user-supplied depths to propagate depth information across all the segments in a video. This lets us assign depths over regions that generate few SFM features. It also significantly reduces interaction time by letting users set depths for the complete scene without manually assigning depth to each region. For segments with no corresponding depths, we must infer the depth from feature points generated by SFM or from user-supplied information. We wish the depths to be influenced by the supplied depths, and we wish to encourage the depth discontinuities to occur at color discontinuities. We formulated the problem of setting segments to depths as the maximum a posteriori (MAP) solution of a Markov random field (MRF) model, found using a volumetric graph cut. For each frame, we construct a graph, assigning each segment a node at each of a series of depth levels covering the feature points’ depth range. Each node is connected to the corresponding node in the neighboring depth levels and is connected in each level to its neighbors in the image. The cut labels nodes as belonging to either the world or empty space. We minimize E(a, x, z, f ) = U(a, x) + lV(a, x, z, f ),

where a is the set of node labels, x is the image color data, z is the set of node depth values, and f is the set of feature points. U encourages the cut to follow edges in the image, whereas V encourages the cut to follow the distribution of feature point depths. l is a constant weighting between the two terms, set to 5. We define U as for region selection (see the section ”Creating regions“ in the main article), with M as the set of all neighboring node pairs in a depth level, and D as the set of neighboring pixels in nodes p and q. We define V as V (α , x , z , f ) =

∑

( p ,qq )∈N

δ (αp , αq ) kpq      min (z p , z q ) − µpq exp  2   2σ pq  

  2     , 

where N is the set of all neighboring node pairs across depth levels, and k is the number of pixels in the segment corresponding to a pair of nodes. The weighted mean mpq of feature depths fd in a window around the segment is


w i fdi

i =1 n

wi

.

i =1

wi, which gives a higher weight to nearby features with a color similar to that of the segment, is  f − x pq  xi w i = exp   2β 

2

   (fyi − y pq )   exp γ   s  

2

  fci ,  

where fxi, f yi, and fci are the feature color, position, and confidence values, s is the window radius, and xpq and ypq are the segment color and centroid position. We set fci to give a heavier weight to user-specified features than SFM features. We set the window size to include at least five features, we define b as for region selection, and g is a 2 constant scaling factor, set to 5. σ pq , which is

2 σ pq

∑ =

n

w i fdi − µpq

i =1

∑

n

wi

2

+e

,

i =1

reduces the cost of cuts far from mpq if the spread of depths in the window is wide. We add the constant e (set to 3) to also reduce the cost if the sum of feature weights is low (that is, if few features are in the window, features are far from the segment, or feature colors poorly correspond with segment color). After generating depth for a frame, we encourage temporal consistency by propagating depths to the next frame. We do this by projecting segment centroids into the next frame and adding features at the corresponding image locations to the feature set. To avoid errors accumulating over the sequence, these propagated features receive a lower confidence value than points from SFM or interaction.

Reference 1. G. Zhang et al., “Recovering Consistent Video Depth Maps via Bundle Optimization,” Proc. 2008 IEEE Conf. Computer Vision and Pattern Recognition (CVPR 08), IEEE CS Press, 2008.

Assigning region depths. Depth Director uses points recovered from the SFM step to automatically set segment depths. (For details on how we accom 40

∑ ∑

n

mpq =

plish this, see the “Dense Depth from Structurefrom-Motion Data” sidebar.) Figure 5 shows some results from this technique.

If Depth Director can’t perform SFM owing to limited camera motion, users can manually assign depths to selected regions (thus creating a feature point per each affected segment). Depth Director then propagates depths from feature points in the same manner as if SFM data were available. Users can manually assign depths by dragging selected regions in the perspective view (see Figure 3b, right). This view assists in setting depth by indi cating the relative depths of a scene’s elements. It shows both the camera locations and the frame’s boundary, indicating to the user whether regions with a given depth assignment will appear to be behind or in front of the image plane. Manipulating depth contrast. The user can scale depths in a selected region by increasing or de creasing local disparity. The depth of the selected segment farthest from the camera is kept fixed, and dragging the selected region scales the re maining segment depths accordingly. By empha sizing depth in some regions and de-emphasizing it in others, users can draw the viewer’s focus to particular parts of the frame, enhancing the over all 3D effect. We call these types of operations depth contrast manipulation; this is analogous to manipulating image contrast to enhance it. Figure 6 shows results from increasing depth contrast to enhance the depth effect for a scene’s key element.

(a)

(c)

(b)

(d)

(e)

Figure 5. Using structure-from-motion (SFM) data. (a) An image with detected points superimposed. (b) An oblique view of the scene after diffusion of SFM data. The user corrects an area where SFM fails by (c) selecting the regions representing the table’s edge, (d) flattening, and (e) dragging them.

Depth Templates Directly setting depths in the perspective view as signs the selected region’s depths to a flat plane parallel to the image plane. With this technique, some complex elements of the scene will have a flat, “cardboard” appearance; convincingly assign ing depths to elements of the scene such as the ground plane can be difficult. For increased detail or to simplify depth assignment, users can employ depth templates to set depths for segments in se lected regions. A depth template represents a predefined de formable shape; it’s a triangle mesh that users can translate, scale, and rotate to fit the selected region. To assign depth, Depth Director projects selected segments onto the mesh’s visible surface. Clicking a template button applies a new template to the selected region. The system initially posi tions the template and scales it to fit the selected region in each frame. Figure 7 shows a planar tem plate being applied to a selected region. Manipulating depth templates. Each template shape S(x, y, z) has parameters for translation, t = (t x, t y, t z); scale, a = (a x, ay, a z); and orientation, R,

(a)

(b)

Figure 6. Processing frames using depth contrast. (a) The original results. (b) The same frames with increased depth contrast on the main features. Note the elongation of the car from an oblique view. IEEE Computer Graphics and Applications

41

Camera Culture

Figure 7. Adding depth using depth templates. The affected region is highlighted in red. Depth templates simplify the 2D-to-3D conversion process.

corresponding to Euler angles (rx, r y, rz) about the x-, y-, and z-axes, respectively. The final shape is given by RS(axx, ay y, a zz) + t. Users can modify template parameters by drag ging handles aligned with the axes of the current template orientation. The transformation images in Figure 8 show the parameters modified by each handle. An additional handle adjusts the tem plate’s depth contrast, scaling its depths in the camera’s direction. For objects moving through a scene, users can specify the template position and orientation for a set of keyframes, and the sys tem will interpolate them over the sequence. Users modify t z by dragging the selected region outside the handles. Types of templates. Depth Director uses three basic types of templates: planar, simple, and complex. Users can employ planar templates to create a ground plane or set vertical planes for walls. They can also use them for background elements that are far enough from the camera that a detailed depth assignment isn’t required. Depth Director provides separate templates for left vertical, right vertical, and ground planes. In our experiments, planes were the most commonly used template type, being applied in most shots. In Figure 8a, a user applied a vertical-plane template to a car door. On the left, the user rotated a plane template around the y-axis by adjusting the handle corre sponding to r y. Users can employ simple templates to construct the basic geometry of typical scenes and to en hance the depth effect for objects with a shape similar to one of the simple types. Users can create a variety of shapes by partially intersecting these templates with the selected region and applying multiple templates to the same region. Simple templates come in three types: sphere, cylinder, and box. For the sphere template, scale parameters correspond to an ellipsoid’s radii. In Figure 8b, left, the user elongated the template by 42


adjusting ax. On the right, the user applied the sphere template to shape the man’s head. In Figure 8c, the user employed a cylinder tem plate to enhance the depth effect for an object with a more complex shape. On the left, the user shifted the template by dragging the base of the handles to adjust tx. Users can employ the box template to apply depth to pairs of planes at right angles, a com monly occurring structure. In Figure 8d, the user employed this template to create the car’s basic shape. On the left, the user dragged a handle to increase the depth contrast. Complex templates let users set more detailed depths for commonly occurring complex struc tures. Depth Director provides additional controls to adjust these templates’ low-level shape, if re quired. The UI divides complex templates into par titions, which users can individually scale using the interface handles. The Depth Director prototype provides two types of complex templates: face and car (see Fig ures 8e and 8f). These templates are relatively ge neric, so users can apply them to a range of faces and cars. In our experiments, users achieved a convincing 3D effect by applying a template with a similar shape to a complex object’s true shape, even if that template didn’t accurately represent all the object’s details. Automated template fitting. To reduce the time for aligning templates with images, Depth Director provides automatic fitting for some template types. For the planar and box templates, the system ap plies edge and line extraction to a frame to find a set of line segments in the selected region. For planar templates, Depth Director can au tomatically select the plane type by computing vanishing points using horizontal and vertical line segments. The system uses these points’ posi tions with respect to the center of the selection to classify and initialize either a ground plane or a left or right vertical plane. The system then optimizes the plane’s position and orientation us ing the Levenberg-Marquardt algorithm (LM)3 to align the plane with the extracted edge segments. It does this by minimizing the smallest angle be tween each line segment projected onto the plane and one of the vectors defining the plane. For box templates, Depth Director uses LM to align line segments with the template’s nearest plane. Users can manually adjust and refine automaticfitting results using the handles we described ear lier. If the user has set a ground plane, the system sets the initial distance from the camera for a

rz

rz

rx

rx

ry

ry

(a)

ay

ay

az

ax

az

ax

(b)

tx

ty

tx

ty

(c)

(d)

(e)

(f) Figure 8. Frames processed using depth templates for a (a) plane, (b) sphere, (c) cylinder, (d) box, (e) face, and (f) car. From left to right, the figure shows template transformations, base template shapes, results with uniform depths applied to a region, and results with template depths assigned.

IEEE Computer Graphics and Applications

43

Camera Culture

newly applied depth template so that the selec tion’s base touches the ground. For face templates, Depth Director can align keypoints on the mesh (corresponding to the eyes, nose, mouth, and face’s boundary) with those fea tures’ location in the image. Figure 8e shows mesh keypoints and the associated image points. To locate features, the system uses Paul Viola and Michael Jones’ method.4 It first scans the se lected region at multiple scales, detecting regions containing a given feature, using a boosted cascade of classifiers based on Haar-like features, trained on images of the corresponding facial region. We chose this method for its accuracy and real-time performance. Similarly to planar fitting, we opti mize the face template’s position and orientation using LM to align keypoints on the model with the detected features. We do this by minimizing the 2D distance between the center of each feature and the associated keypoint’s position, projected onto the image.

Refining SFM Results Sequences that are largely amenable to SFM re construction might include regions where recon struction fails, owing to moving objects or lack of detail. We provide additional interactive tech niques to improve depth assignment results in these regions. Flattening regions. Regions with minimal texture (such as the front edge of the table in Figure 5) will have few associated feature points in an SFM reconstruction. Without information on these seg ments’ depths, the dense depth assignment pro cess might assign to segments in these regions the depths of nearby regions with more detail, making regions disjoint that should have uniform depth. To encourage uniform depth in these regions, a “flatten” command places heavy weights on edges between selected segments, encouraging the graph cut to assign consistent depths over the selected region. Users can also set depths directly or with a planar template. Figure 5 shows an example of flattening a region and setting a new depth. Deleting SFM points. In highly dynamic scenes, SFM might have difficulty separating camera move ment from independently moving objects, causing reconstruction to fail or giving noisy results. In such cases, users can employ manual segmenta tion to mark dynamic regions and can exclude extracted features in those regions from SFM pro cessing, to improve reconstruction performance. Although this will help recover camera parame 44


ters, no features will be available for the excluded regions. As with regions with minimal texture, and other cases in which SFM results are sparse or noisy, user interaction can improve automatic depth assignment’s results.

Pixel-Level Refinement Depth Director provides additional operations to adjust depths on the pixel level, if required. Smoothing. As the system performs selection and depth assignment on the segment level, it will as sign each pixel in a segment the same depth. In re gions that should appear smooth, this might create noticeable seams at boundaries between segments. For continuous regions, matting isn’t applicable, but users can remove such seams by applying Gaussian smoothing to pixel depths in a selected region. Rounding. The cardboard appearance we described earlier can be particularly noticeable with static objects, for which the depth effect isn’t aided by the monocular depth cues that motion provides. In these cases, if an appropriate depth template isn’t available, users can improve the depth effect by giv ing these objects a rounded, “popped out” shape. We generate these shapes by dilating the region’s selection boundary to select an interior region, re ducing this interior region’s depth, and smoothly interpolating depths between the interior and ex terior boundaries. In practice, the system gener ated pleasing results by ■■

■■

■■

dilating 30 percent of the minimum dimension of a bounding box around the selected region, reducing the interior depth by a user-specified distance, and setting depths in the boundary to a + (1 – x3) (b – a), where a is the exterior depth, and b is the interior depth. x ∈ [0,1] is the normalized distance of a pixel to the interior boundary, such that x = 1 if the pixel is right at the exterior boundary.

Although the generated 3D structure is simple, combining a simple 3D shape with other depth cues such as shading can give the rounded region a convincing 3D appearance. This technique is ap plicable to a variety of objects. Figure 9 shows a series of frames generated with and without ap plying rounding.

Setting Rendering Parameters Rules and guidelines exist for setting up a realworld two-camera 3D capture rig to achieve the correct look and results that are pleasing to the

viewer.5 Likewise, we provide the appropriate con trols with ranges that guide users in creating foot age with these properties. First, users can adjust the disparity (the hori zontal separation between the left and right vir tual cameras) with the slider controls beneath the video view. They can also modify the relative distance from the camera to the scene. As users adjust these variables, Depth Director dynami cally updates the focal length to maintain a view ing angle that will keep the whole scene in view. Furthermore, to help users better visualize these camera parameters, the system renders the cam eras and their frustums in the perspective view as wireframes. The UI displays virtual cameras to show these changes’ effects.

Rendering Rendering the video as an unbroken mesh leads to visual artifacts at large depth discontinuities, in which edges can appear smeared or broken. Depth Director avoids this by breaking the mesh at these discontinuities and adding a matte boundary to the foreground region, to smoothly blend between fore ground and background colors. The system identi fies pixels near depth discontinuities as boundary pixels, to which it applies Bayesian matting.6 For each pixel in the boundary region, Bayesian mat ting builds Gaussian models of foreground and background color and simultaneously optimizes for foreground color, background color, and opacity values using a maximum-likelihood approach. To avoid visible holes in the background, the system inpaints background color values in the matte re gion’s interior by interpolating nearby background colors, which we found to give acceptable results.

Results We tested whether Depth Director ■■ ■■

■■

lets users quickly convert 2D footage to 3D, handles a variety of video inputs, including those challenging to traditional computer vision algorithms, and provides a level of control such that users can create visually plausible 3D interpretations.

Tables 1 and 2 list the video clips we processed; Figure 10 shows representative processed frames. We categorized the clips as “SFM friendly,” “semi SFM friendly,” and “SFM unfriendly”; the work flow differed on the basis of the category. We gen erated the results on a PC with an Intel 2.66-GHz Core 2 Quad CPU, 2 Gbytes of RAM, and a Ge Force 9800 GT graphics card.

(a)

(b)

Figure 9. Processing frames using rounding. (a) The original results. (b) The same frames with rounding applied to the main features. Rounding is a simple but effective way to add depth to complex objects.

SFM-Friendly Video We start with examples involving static scenes with moving cameras, from which SFM informa tion is subsequently available. The analysis stage’s initial results showed convincing 3D, so the user had to only make small corrections and apply smoothing to selected regions. Specifically, the user had to add depth to areas with low feature density, correct some depths at edges, and run the smoothing operation. IEEE Computer Graphics and Applications

45

Camera Culture

Table 1. Results for processing single-shot movie clips.* Clip Indoor1

No. of frames 30

Type of clip

Processing time (min.)

SFM friendly

42

Interaction time (min.) 5

Rendering time (min.)

Average interaction time (min.)†

2

5

Tree

85

SFM friendly

130

0

5

0

Pool

100

Semi SFM friendly

132

8

6

2

56

Semi SFM friendly

110

8

5

4

Man

55

SFM unfriendly

67

6

4

3

Running

35

SFM unfriendly

35

10

4

9

Car

47

SFM unfriendly

43

14

6

9

Chevy

56

SFM unfriendly

63

13

6

7

Ping Pong

56

SFM unfriendly

108

12

6

6

No. of shots

Processing time (hrs.)

Interaction time (hrs.)

Rendering time (hrs.)

Average interaction time (min.)†

Outdoor1

* Figure 10 shows representative frames for these clips. † Per second of footage (30 frames)

Table 2. Results for processing longer clips with multiple shots.* Clip

No. of frames

Outdoor2

1,356

17

24.6

10.1

1.9

13

Outdoor3

705

10

12.8

6.6

0.8

17

Indoor2

915

5

21.6

4.9

1.1

10

1,071

11

26.3

6.7

1.8

11

994

7

23.6

3.7

1.8

7

CarInside CarOutside

* Figure 10 shows representative frames for these clips. † Per second of footage (30 frames)

Semi-SFM-Friendly Video When the input video contained dynamic scene elements (again, with a moving camera), most results exhibited correct 3D for the static scene elements. As we expected, the user had to add a popped-out appearance to dynamic elements, which, without SFM points, tend to smear across the scene’s background.

SFM-Unfriendly Video Here, the sequences don’t have characteristics amenable to SFM, so we didn’t begin with depth estimates. These sequences include single-shot clips and longer clips with multiple shots. Single-shot clips. For a series of single-shot clips, the user obtained convincing 3D effects for a shot in approximately 11 minutes for 2-second clips. The amount of required interaction varied. In one case, the user merely set the background and foreground depths and added depths to foreground objects using depth templates. In another, the user created the ground plane, set the depths of moving regions over the sequence, and set the background depth in various regions to seed the final result. In the most complex case, the user built basic scene structures with plane templates, set depths for 46


major moving objects, adjusted these depths over the sequence, and used depth templates for build ings, vehicles, and faces. Most interaction time in volved selecting objects and setting initial depths. Multiple-shot clips. On a series of longer clips fea turing multiple shots, user interaction time varied between 7 and 17 minutes per second of footage (see Table 2). The amount of required interaction varied with the scenes’ complexity, with more complex shots featuring multiple moving objects, significant camera motion, and a similar appear ance between foreground and background objects. For these sequences, we kept the disparity between individual shots relatively consistent to prevent jarring transitions.

Discussion and Future Work As we mentioned before, the correct 3D results might not be as desirable in an artistic medium such as film. This issue being subjective, we chose to maximize user flexibility during the directing stage to accomplish as realistic or unrealistic re sults as desired. It seems possible to create highly convincing 3D effects even with reasonably crude 3D structures. Generally, recovering depth automatically from

2D video is extremely challenging. However, in our restricted case of recovering depth for 3D movies, relatively coarse depths might be sufficient if we can achieve a convincing depth effect in combi nation with other perceptual depth cues (for ex ample, perspective, shading, or motion parallax). We could extend Depth Director in several ways. Besides improving the automatic prepro cessing steps of segmentation and SFM, we could enhance the UI. For example, in the directing UI, we envision further depth contrast tools that auto matically manipulate relative depths for specified regions or reduce the number of planes in space. This would accentuate the 3D effect at depth dis continuities, rather than having visuals full of subtle variations. Adding time-based control to the camera parameters could let users create ef fects similar to a dolly zoom or let them smoothly transition between shots. The system is highly modular, and as such isn’t tied to any particular implementation of its un derlying elements. So, it could easily exploit new developments that would improve any specific component’s performance. For example, we could enhance final results by extending the system’s rendering component. Whereas Depth Director currently performs matting on each frame indi vidually, we could improve the results for moving objects by encouraging temporal consistency. We could enhance the results at large depth dispari ties with a more sophisticated infilling method to synthesize occluded regions’ content. Adding higher-level understanding of the scene (for example, knowing which parts of the scene are sky and which are foliage) would enable easier depth augmentation for users. Another interest ing direction is to better understand how accurate left-right disparity must be to “sell” the effect of depth on a flat screen. Much of the lore of stereographic-photography best practices5 is based on heuristics and could benefit from further sci entific study.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

A

s 3D movie production matures, an auto mated means for evaluating stereoscopic view synthesis is necessary. However, we don’t know whether you can automatically predict visual qual ity (for example, for preventing eyestrain). Enough variability exists in how humans perceive 3D that we don’t think you can effectively compare results without some form of user study. The emphasis should perhaps be on how to generate a small but representative set of views for evaluation, to mini mize the subjects’ effort.

Figure 10. Representative processed frames from various 2D movie clips: (a) Indoor1, (b) Tree, (c) Outdoor1, (d) Pool, (e) Man, (f) Running, (g) Car, (h) Chevy, (i) Ping Pong, (j) Outdoor2, (k) Outdoor3, (l) Indoor2, (m) CarInside, and (n) CarOutside. These frames are in anaglyph format; you might need to magnify them to perceive 3D. Tables 1 and 2 give the clips’ details. IEEE Computer Graphics and Applications

47

Camera Culture

Acknowledgments We thank the anonymous reviewers for their thoughtful comments that enabled us to improve this article.

References 1. C.L. Zitnick, N.A. Jojic, and S.B. Kang, “Consistent Segmentation for Optical Flow Estimation,” Proc. 10th IEEE Int’l Conf. Computer Vision, vol. 2, IEEE Press, 2005, pp. 1308–1315. 2. C. Rother, V. Kolmogorov, and A. Blake, “GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts,” ACM Trans. Graphics, vol. 23, no. 3, 2004, pp. 309–314. 3. W.H. Press et al., Numerical Recipes in C: The Art of Scientific Computing, Cambridge Univ. Press, 1992. 4. P. Viola and M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features,” Proc. 2001 IEEE Conf. Computer Vision and Pattern Recognition (CVPR 01), vol. 1, IEEE CS Press, 2001, pp. 511–518. 5. L. Lipton, Foundation of the Stereoscopic Cinema, Van Nostrand Reinhold, 1982. 6. Y.Y. Chuang et al., “A Bayesian Approach to Digital

Matting,” Proc. 2001 IEEE Conf. Computer Vision and Pattern Recognition (CVPR 01), vol. 2, IEEE CS Press, 2001, p. 264. Ben Ward is a researcher and PhD student in the University of Adelaide’s Australian Center for Visual Technologies. His research focuses on interactive 3D reconstruction from video. Ward has an honours degree in computer science from the University of Adelaide. Contact him at ben.ward@ adelaide.edu.au. Sing Bing Kang is a principal researcher at Microsoft, working on image and video enhancement as well as imagebased modeling. Kang has a PhD in robotics from Carnegie Mellon University. Contact him at [email protected]. Eric P. Bennett is a senior program manager at Microsoft, working on augmented reality. Bennett has a PhD in computer science from the University of North Carolina at Chapel Hill. Contact him at [email protected]. Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.

stay connected. Keep up with the latest IEEE Computer Society publications and activities wherever you are.

TM

48

| @ComputerSociety | @ComputingNow


| facebook.com/IEEEComputerSociety | facebook.com/ComputingNow

| IEEE Computer Society | Computing Now

Adding depth to the next generation of physical activity models.

A depth-based fall detection system using a Kinect® sensor.

Accuracy of corneal trephination depth using the Moria single-use adjustable depth vacuum trephine system.

Depth discrimination of a line is improved by adding other nearby lines.

Burn depth: a review.

Rapid depth dose determination by a computer-controlled dosimetry system.

A touch panel surgical navigation system with automatic depth perception.

Sampling Depths, Depth Shifts, and Depth Resolutions for Bi(n)(+) Ion Analysis in Argon Gas Cluster Depth Profiles.

Perceiving depth.

Don't Be Depth Charged by the Depth Gauge.

A Navigation System for the Visually Impaired: A Fusion of Vision and Depth Sensor.

Using stereokinetic effect to convey depth: computationally efficient depth-from-motion displays.

Reliable Fusion of Stereo Matching and Depth Sensor for High Quality Dense Depth Maps.

Getting to the bottom of orthographic depth.

Separation methodology to improve proteome coverage depth.

Foreground segmentation in depth imagery using depth and spatial dynamic models for video surveillance applications.

Layered compression for high-precision depth data.

Letter: Depth absorbed dose distributions for electrons.

Efficient Depth Enhancement Using a Combination of Color and Depth Information.

Stereoscopic depth constancy.

Depth of anaesthesia.

Measurement of Breslow depth.

Passive depth estimation using chromatic aberration and a depth from defocus approach.

Motion parallax thresholds for unambiguous depth perception.