Abstract:
Depth estimation is a traditional computer vision task that plays a vital role in understanding the geometry of the 3D scenes. The difficulty of the depth estimation task based on monocular images was how to extract the context information of the long-range dependence in image features, therefore an adaptive context aggregation network (ACANet) was proposed to solve this problem. The ACANet was based on the supervised self-attention (SSA) model, which could adaptively learn the similarities with task traits between arbitrary pixels to simulate the continuous context information, and the attention weight distribution of the model learning was used to aggregate and extract the image features. Firstly, the monocular depth estimation task was designed as a multi-class classification problem at the pixel level. Then the attention loss function was designed to reduce the semantic inconsistency of the RGB image and the depth map, and the features indexed by positions were globally pooled by the generated pixel-level attention weights. Finally, a soft ordinal inference (SOI) algorithm was proposed, which fully utilized the predicted confidence of network to transform the discrete depth labels into the smooth continuous depth maps, and the accuracy was improved (rmse decreased by 3%). The experimental results on the public benchmark data set NYU Depth V2 of the monocular depth estimation show that, the rmse index is 0.490, and the threshold index is 82.8%. The better results are obtained, which prove the superiority of the proposed algorithm.