Rich global feature guided network for monocular depth estimation

作者:

Highlights:

摘要

Monocular depth estimation is a classical but challenging task in the field of computer vision. In recent years, Convolutional Neural Network (CNN) based models have been developed to estimate high-quality depth map from a single image. Most recently, some Transformer based models have led to great improvements. All the researchers are looking for a better way to handle the global processing of information which is crucial for depth relation inference but of high computational complexity. In this paper, we take advantage of both the Transformer and CNN then propose a novel network architecture, called Rich Global Feature Guided Network (RGFN), with which rich global features are extracted from both encoder and decoder. The framework of the RGFN is the typical encoder-decoder for dense prediction. A hierarchical transformer is implemented as the encoder to capture multi-scale contextual information and model long-range dependencies. In the decoder, the Large Kernel Convolution Attention (LKCA) is adopted to extract global features from different scales and guide the network to recover fine depth maps from low spatial resolution feature maps progressively. What's more, we apply the depth-specific data augmentation method, Vertical CutDepth, to boost the performance. Experimental results on both the indoor and outdoor datasets demonstrate the superiority of the RGFN compared to other state-of-the-art models. Compared with the most recent method AdaBins, RGFN improves the RMSE score by 4.66% on the KITTI dataset and 4.67% on the NYU Depth v2 dataset.

论文关键词:Monocular depth estimation,Transformer,Large kernel convolution attention,Global feature

论文评审过程:Received 14 March 2022, Accepted 14 July 2022, Available online 18 July 2022, Version of Record 23 July 2022.

论文官网地址:https://doi.org/10.1016/j.imavis.2022.104520