CHEN Yuru, ZHAO Haitao. Depth estimation based on adaptive pixel-level attention model[J]. Journal of Applied Optics, 2020, 41(3): 490-499. DOI: 10.5768/JAO202041.0302002
Citation: CHEN Yuru, ZHAO Haitao. Depth estimation based on adaptive pixel-level attention model[J]. Journal of Applied Optics, 2020, 41(3): 490-499. DOI: 10.5768/JAO202041.0302002

Depth estimation based on adaptive pixel-level attention model

More Information
  • Received Date: September 01, 2019
  • Revised Date: December 29, 2019
  • Available Online: May 29, 2020
  • Depth estimation is a traditional computer vision task that plays a vital role in understanding the geometry of the 3D scenes. The difficulty of the depth estimation task based on monocular images was how to extract the context information of the long-range dependence in image features, therefore an adaptive context aggregation network (ACANet) was proposed to solve this problem. The ACANet was based on the supervised self-attention (SSA) model, which could adaptively learn the similarities with task traits between arbitrary pixels to simulate the continuous context information, and the attention weight distribution of the model learning was used to aggregate and extract the image features. Firstly, the monocular depth estimation task was designed as a multi-class classification problem at the pixel level. Then the attention loss function was designed to reduce the semantic inconsistency of the RGB image and the depth map, and the features indexed by positions were globally pooled by the generated pixel-level attention weights. Finally, a soft ordinal inference (SOI) algorithm was proposed, which fully utilized the predicted confidence of network to transform the discrete depth labels into the smooth continuous depth maps, and the accuracy was improved (rmse decreased by 3%). The experimental results on the public benchmark data set NYU Depth V2 of the monocular depth estimation show that, the rmse index is 0.490, and the threshold index is 82.8%. The better results are obtained, which prove the superiority of the proposed algorithm.
  • [1]
    SILBERMAN N, HOIEM D, KOHLI P, et al.Indoor segmentation and support inference from RGBD images[C]//Comput. Vis -ECCV 2012. Berlin: Springer, 2012: 746-760.
    [2]
    郭连朋, 陈向宁, 刘彬, 等. 基于Kinect传感器多深度图像融合的物体三维重建[J]. 应用光学,2014,35(5):811-816.

    GUO Lianpeng, CHEN Xiangning, LIU Bin, et al. 3D-object reconstruction based on fusion of depth images by Kinect sensor[J]. Journal of Applied Optics,2014,35(5):811-816.
    [3]
    SIMON M, MILZ S, AMENDE K, et al. Complex-YOLO: an euler-region-proposal for real-time 3D object detection on point clouds[M]. Cham: Springer International Publishing, 2018: 197-209.
    [4]
    LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual networks[C]//2016 Fourth International Conference on 3D Vision. Stanford, CA: IEEE, 2016: 239-248.
    [5]
    裴嘉欣, 孙韶媛, 王宇岚, 等. 基于改进 YOLOv3 网络的无人车夜间环境感知[J]. 应用光学,2019,40(3):380-386. doi: 10.5768/JAO201940.0301004

    PEI Jiaxin, SUN Shaoyuan, WANG Yulan, et al. Nighttime environment perception of driverless vehicles based on improved YOLOv3 network[J]. Journal of Applied Optics,2019,40(3):380-386. doi: 10.5768/JAO201940.0301004
    [6]
    EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[C]//International Conference on Neural Information Processing Systems. USA: arXiv, 2014.
    [7]
    EIGEN D, FERGUS R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture[C]//Proceedings of the IEEE International Conference on Computer Vision(ICCV). USA: IEEE, 2015.
    [8]
    吴寿川, 赵海涛, 孙韶媛. 基于双向递归卷积神经网络的单目红外视频深度估计[J]. 光学学报,2019,37(12):246-254.

    WU Shouchuan, ZHAO Haitao, SUN Shaoyuan. Depth estimation from monocular infrared video based on Bi-recursive convolutional neural network[J]. Acta Optica Sinica,2019,37(12):246-254.
    [9]
    GARG R, VIJAY K B G, CARNEIRO G, et al. Unsupervised cnn for single view depth estimation: geometry to the rescue[C]//European Conference on Computer Vision. Cham: Springer, 2016.
    [10]
    CLÉMENT G, AODHA O M, BROSTOW G J.Unsupervised monocular depth estimation with left-right consistency[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). USA: IEEE, 2017.
    [11]
    顾婷婷, 赵海涛, 孙韶媛. 基于金字塔型残差神经网络的红外图像深度估计[J]. 红外技术,2018,40(5):21-27.

    GU Tingting, ZHAO Haitao, SUN Shaoyuan. Depth estimation of infrared image based on pyramid residual neural networks[J]. Infrared Technology,2018,40(5):21-27.
    [12]
    RONNEBERGER O, FISCHER P, BROX T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241.
    [13]
    HUANG Jinggang, LEE A B, MUMFORD D. Statistics of range images[C]//Computer Vision and Pattern Recognition. USA: IEEE, 2000: 324-331.
    [14]
    CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[J]. arXiv, 2017: 1706.05587.
    [15]
    YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions[J]. arXiv, 2015: 1511.07122.
    [16]
    WANG Panqu, CHEN Pengfei, YUAN Ye, et al. Understanding convolution for semantic segmentation[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). USA: IEEE, 2018.
    [17]
    HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016.
    [18]
    NIU Zhenxing, ZHOU Mo, WANG Le, et al. Ordinal regression with multiple output cnn for age estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016.
    [19]
    FU Huan, GONG Mingming, WANG Chaohui, et al. Deep ordinal regression network for monocular depth estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2018.
    [20]
    VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. USA: NIPS Foundation, Inc., 2017.
    [21]
    WANG Xiaolong, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2018.
    [22]
    LI Bo, DAI Yuchao, HE Mingyi. Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference[J]. Pattern Recognition,2018,83:328-339. doi: 10.1016/j.patcog.2018.05.029
    [23]
    CAO Y Z H, WU Z, SHEN C. Estimating depth from monocular images as classification using deep fully convolutional residual networks[C]//IEEE Transactions on Circuits and Systems for Video Technology. USA: IEEE, 2017.
    [24]
    JIA Deng, WEI Dong, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//IEEE Computer Vision & Pattern Recognition.USA: IEEE, 2009: 248-255.
    [25]
    XU Dan, RICCI E, OUYANG Wanli, et al. Multi-scale continuous CRFs as sequential deep networks for monocular depth Estimation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). USA: IEEE Computer Society, 2017.

Catalog

    Article views (1038) PDF downloads (32) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return