基于自适应像素级注意力模型的场景深度估计

陈裕如; 赵海涛

doi:10.5768/JAO202041.0302002

基于自适应像素级注意力模型的场景深度估计

Depth estimation based on adaptive pixel-level attention model

摘要

摘要: 深度估计是传统的计算机视觉任务，在理解三维场景中起着至关重要的作用。基于单目图像的深度估计任务的困难在于如何提取图像特征中大范围依赖的上下文信息，提出了自适应的上下文聚合网络（adaptive context aggregation network，ACANet）用于解决该问题。该方法基于有监督的自注意力模型(supervised self-attention，SSA)，能够自适应地学习任意像素之间的具有任务特性的相似性以模拟连续的上下文信息，并通过模型学习的注意力权重分布用来聚合提取的图像特征。将单目深度估计任务设计为像素级的多分类问题，经过设计的注意力损失函数减少RGB图像和深度图的语义不一致性，通过生成的像素级注意力权重对由位置索引的特征进行全局池化。最后提出一种软性有序推理算法（soft ordinal inference，SOI），充分利用网络的预测置信度，将离散的深度标签转化为平滑连续的深度图，并且提高了准确率（rmse下降了3%）。在公开的单目深度估计基准数据集NYU Depth V2上的实验结果表明：rmse指标为0.490，阈值指标为82.8%，取得了较好的结果，证明了本文提出的算法的优越性。

Abstract: Depth estimation is a traditional computer vision task that plays a vital role in understanding the geometry of the 3D scenes. The difficulty of the depth estimation task based on monocular images was how to extract the context information of the long-range dependence in image features, therefore an adaptive context aggregation network (ACANet) was proposed to solve this problem. The ACANet was based on the supervised self-attention (SSA) model, which could adaptively learn the similarities with task traits between arbitrary pixels to simulate the continuous context information, and the attention weight distribution of the model learning was used to aggregate and extract the image features. Firstly, the monocular depth estimation task was designed as a multi-class classification problem at the pixel level. Then the attention loss function was designed to reduce the semantic inconsistency of the RGB image and the depth map, and the features indexed by positions were globally pooled by the generated pixel-level attention weights. Finally, a soft ordinal inference (SOI) algorithm was proposed, which fully utilized the predicted confidence of network to transform the discrete depth labels into the smooth continuous depth maps, and the accuracy was improved (rmse decreased by 3%). The experimental results on the public benchmark data set NYU Depth V2 of the monocular depth estimation show that, the rmse index is 0.490, and the threshold index is 82.8%. The better results are obtained, which prove the superiority of the proposed algorithm.

HTML全文

参考文献(25)

施引文献

资源附件(0)