In this blog post, I will review the paper Cost Volume Pyramid Based Depth Inference for Multi-View Stereo from Jiayu Yang et al. published in CVPR 2020. After introducing the topic and relevant background knowledge, I will explain the method in my own words. Then we will discuss the results and future works. You can also download my presentation slides of the paper and view the slides locally by powerpoint.
==Multi-view stereo (MVS) aims to reconstruct the 3D model of a scene from a set of images captured by a camera from multiple viewpoints.== It is a fundamental problem for computer vision community and has application in 3D reconstruction and virtual reality.
centered text
This paper addresses the MVS problem by depth inference, i.e. by inferring the depth map for an image using its neighboring images. The 3D point cloud of the scene can by built directly on the estimated depth map. We refer to the image in interest as the source image and its neighboring images as reference images. Another important thing to notice is that ==the camera poses(rotation, translation) and intrinsics of each viewpoint are known==. And here lies the different between MVS problem with SLAM or Structure from Motion, in which camera poses and 3D model of the scene are jointly estimated.
Given the reference image and its neighboring source images, depth inference for MVS aims to infer the depth map for the refence image.
Some background concepts need to be introduced before we dive into the paper. In this section, camera projection, epipolar line, photometric consistency and cost volume will be introduced and explained. The former three might be familiar with you if you have some experience in computer vision, while cost volume is a specific and relatively new concept.
For a 3D point $ (X,Y,Z) $ in world frame, its corresponding pixel position $ (u,v)$ on the image plane is given as follows:
$$
\lambda (u,v,1)^T = K \hspace{0.1cm}R \hspace{0.1cm}|\hspace{0.1cm} t\hspace{0.1cm}^T
$$
where
The rotation and translation transform the coordinates of the 3D point from world frame into camera frame, and the intrinsic matrix further transform the 3D point from camera frame into image plane.
Note that the preimage of a pixel on image plane will be a line in 3D space. If we want to transform a pixel on the image plane back to the world frame with no knowledge of the depth, the corresponding 3D point (a.k.a the preimage of the pixel) lies on a ray and we don't know where the point is without knowing the depth.
With the camera intrinsics and camera poses of each view, we can transform between different views and easily reproject the pixel into other views.
If the depth of a pixel is unknown, the reprojection of a pixel of viewpoint 1 into viewpoint 2 lies on a line named epipolar line. This is straightforward since the preimage of the pixel in viewpoint 1 is a line in 3D space, and the projection of this 3D line into viewpoint 2 is also a line.
Photometric consistency is a commonly used constraint in computer vision which assumes that the same 3D point projected into different viewpoints should be of similar color. For large lighting changes or non-lambertian surfaces, this constraint might not hold true. But in general, photometric consistency holds for most pixels in the image. With photometric consistency, the depth of the pixel could be estimated by minimizing the reprojection error. For example, in the above figure, we want to fine the reprojection of the blue pixel in the right view. The depth of the blue pixel is unknown, so we assume four depth hypotheses. Each depth hypothesis gives a possible reprojection of the blue pixel in the right view. For each depth hypothesis, we compute the reprojection error (the difference between the original pixel value and the reprojected pixel value). The best depth hypothesis is chosen as the one which gives the smallest reprojection error. We can sample more depth hypotheses and get a more accurate depth estimation.
Of the four depth hypotheses, the green one results in the smallest reprojection error and will be chosen.
Cost volume is the specific background knowledge of this paper and you will only know it if you read some papers about MVS.
The construction of cost volume
Suppose we have two source views, as shown above. Recall that reference view is the view in interest and we want to estimate its depth map; and source views are neighboring views of the reference view. The image dimension is
Mathematically, given a reference view
While traditional methods before deep learning era have great achievements on the reconstruction of a scene with Lambertian surfaces, they still suffer from illumination changes, low-texture regions, and reflections resulting in unreliable matching correspondences for further reconstruction.
Recent learning-based approaches adopt deepCNNs to infer the depth map for each view followed by a separate multiple-view fusion process for building 3D models. These methods allow the network to extract discriminative features encoding global and local information of a scene to obtain robust feature matching for MVS.
In particular, Yao et al. propose MVSNet to infer a depth map for each view. An essential step in MVSNet is to build a cost volume based on a plane sweep process followed by multiscale 3D CNNs for regularization. While effective in depth inference accuracy, its memory requirement is cubic to the image resolution. To allow handling high resolution images, they then adopt a recurrent cost volume regularization process (R-MVSNet). However, the reduction in memory requirements involves a longer run-time.
MVSNet (Yao et al. 2018) | R-MVSNet (Yao et al. 2019) |
In order to achieve a computationally efficient network, Point-MVSNet(Chen et al. 2019) works on 3D point clouds to iteratively predict the depth residual along visual rays using edge convolutions operating on the k nearest neighbors of each 3D point. While this approach is efficient, its run-time increases almost linearly with the number of iteration levels.
Point-MVSNet(Chen et al. 2019)
The key novelty of the presented method (CVP-MVSNet) is building a cost volume pyramid in a coarse-to-fine manner instead of constructing a cost volume at a fixed resolution, which leads to a compact, lightweight network and allows inferring high resolution depth maps to achieve better reconstruction results. # Method In this part, I will explain the methodology of the presented paper (CVP-MVSNet). First I will state the problem formally. Then each part of the method will be explained separately. Finally an overview will be given. ## Problem statement Denote the reference image as $I_0 \in \R^{𝐻×𝑊}$, where 𝐻 and 𝑊 defines its dimensions. Let $\{𝐼_𝑖\}_{i=1}^𝑁$ be its 𝑁 neighboring source images. The corresponding camera intrinsics, rotation matrix, and translation vector for all views $\{K_i,R_i, t_i\}^N_{i =0}$ are known. The goal is to infer the depth map $D$ for $I_0$ from $\{𝐼_𝑖\}_{i=1}^𝑁$ .The feature extraction pipeline consists of two steps. First a
The resulting feature map at level
In the introduction part, cost volume is built directly on images. In practice, building cost volume on learnable features is more robust against illumination changes. Also, previously I only mentioned that the depth hypotheses could be sampled uniformly in the depth range
Mathematically, the iterative refinement step can be formulated as follows. Assume we have the depth estimate
The resulting cost volume at level
Given the constructed cost volume at level
The resulting depth map is simply a weighted sum of each channels of the probability volume, whereas the weights are the corresponding depth hypotheses.
At the coarsest level
For lower levels at the pyramid, assume $r_p=m\Delta d_p^l$denotes the depth residual hypothesis, the depth estimate is given by:
$$
D^l(p)=D^{l+1}{\uparrow}+\sum{m=-M/2}^{M/2-1}r_pP_p^l(r_p)
$$
where
The loss function of the network is simply a
The entire network structure is shown below. Reference and source images are first down-sampled to form an image pyramid. We apply feature extraction network to all levels and images to extract feature maps. We then build the cost volume pyramid in a coarse-to-fine manner. Specifically, we start with the construction of a cost volume corresponding to coarsest image resolution followed by building partial cost volumes iteratively for depth residual estimation in order to achieve depth map for the reference image.
DTU Dataset**** is used for train and test. DTU dataset includes table top objects in laboratory lighting conditions.
DTU Dataset
Tanks and Temples Dataset**** is only used for test. DTU dataset includes table top objects in laboratory conditions. This dataset includes indoor and outdoor scenes under realistic lighting conditions.
Tanks and Temples
Accuracy, completeness and overall score are used to evaluate the quality of reconstructed point clouds.
Denote the ground truth model as
-
Accuracy is the distance from
$R$ to$G$ ; -
Completeness is the distance from
$G$ to$R$ -
Overall score is the average of accuracy and completeness.
The names of the metrics are kind of self-explaining. If only accuracy were reported, it would favor algorithms that only include estimated points of high certainty, e.g. high-textured surface parts. On the other hand, if only completeness were reported it would favor algorithms that include everything, regardless of point quality.
Results on DTU test set. The upper row shows the point clouds and the bottom row shows the normal map corresponding to the orange rectangle. As highlighted in the blue rectangle, the completeness of the proposed method is better than Point-MVSNet. The normal map (orange rectangle) further shows that the proposed method is smoother on surfaces while maintaining more high-frequency details.
Point cloud reconstruction on Tanks and Temple dataset. Note that the model has not trained/fine-tuned on this dataset. This result shows that the presented method has a good generalization ability.
Intermediate point cloud results. Note that the reconstruction quality improved for every iteration of depth residual refinement.
Quantitative results of reconstruction quality on DTU dataset (lower is better). The presented method outperforms all methods on completeness and overall reconstruction quality and achieved seconad best on Accuracy.
Comparison of reconstruction quality, GPU memory usage and runtime on DTU dataset for different input sizes. For the same size of depth maps, the proposed method has a performance similar with Point-MVSNet, and is 6 times faster and consumes 6 times smaller GPU memory. For the same size of input images, the proposed method achieves the best reconstruction with the shortest time and a reasonable GPU memory usage.
The main contribution of this paper is building a pyramid structure in a coarse-to-fine manner. To be honest, I think the methodology does not has much novelty, since coarse-to-fine manner or the pyramid structure is a common approach (e.g. in optical flow, motion estimation and frame interpolation) to increase speed and reduce memory requirement. If one have read the MVSNet paper, one will find that the method in this paper is nothing but a small improvement of the MVSNet.
However, the devil is in the details, as the saying goes. Although the basic idea of this paper is straight forward, a lot of implementation details have to be determined carefully to achieve state-of-the-art results. For example, the choice of depth hypotheses number
One possible improvement would be to jointly estimate the depth map for both reference image and source images, and output the merged 3D point cloud directly. In this paper, the output is only a depth map for a single image. In some cases, we might want the 3D model of a scene and combining different viewpoints could make the reconstruction more complete.