Regarding review of Learning Joint 2D-3D Representations for Depth Completion

Question

Regarding review of Learning Joint 2D-3D Representations for Depth Completion

sparshgarg23 opened this issue 4 years ago · comments

Hi,I have written a review of the paper mentioned in the subject line.Would like to have your opinion about the paper as well as my review.
Enclosed is the review.
Joint 2D-3D representation for depth completion

1. A lot of survey has been done in the field of depth estimation,the paper uses features from both 2D(camera) and 3D(lidar) ,and fuses them to get a more sharper and cleaner estimate of the depth.
2. Compared to other existing traditional 3D depth estimation approaches,what separates this paper is it’s ability to learn a better representation without relying heavily on complex data and labels.
3. The key defining feature is the usage of 2D-3D convolutional blocks.This allows the network to learn in two separate feature domains.The first  branch learns features in 2D using convolutional layers.The learning of features in 3D is achieved with the help of using continuous convolutional neural networks.
4. Notice that both continuous constitutional neural nets and traditional CNNs  representing the output as a weight sum of all the neighboring features.However whereas CNNs assume that data can be represented as a grid(which makes finding neighbors easy),the same assumption doesn’t hold for point cloud data.
5. Because of the sparse nature of lidar data,the first step is to use a k nearest neighboring algorithm to determine all the neighbors.Those K nearest neighbors  along with the input features are then fed to a Multi layer perceptron  to get the kernel parameters.Finally,this kernel is convoluted with the input feature to get the final result.
6. The 2D features and 3D features are then concatenated to get the final output.To ensure that output feature dimension is consistent with the input dimension,a final convolution is applied.Skip connections between the input and output are also added to facilitate training.
7. Ablation studies show that when evaluating the algorithm on the basis of RMSE,the approach tends to perform satisfactory results.
8. However,certain questions still need to be answered,some of the prominent ones are as follows:
    1. many autonomous driving applications require that camera and lidar should be articulated.Will the architecture be able to estimate the depth with same accuracy in presence of varying camera roll and pitch angles
    2. Recent studies show that RELU networks  tend to suffer from several issues.For example,the very essence of deep learning relies on being able to express any non linear function provided that the network is deep enough.But in practice,it has been observed that the number of activation patterns that RELU can learn is less than the theoretical result.