skip to main content
research-article

DPDFormer: A Coarse-to-Fine Model for Monocular Depth Estimation

Published: 22 January 2024 Publication History

Abstract

Monocular depth estimation attracts great attention from computer vision researchers for its convenience in acquiring environment depth information. Recently classification-based MDE methods show its promising performance and begin to act as an essential role in many multi-view applications such as reconstruction and 3D object detection. However, existed classification-based MDE models usually apply fixed depth range discretization strategy across a whole scene. This fixed depth range discretization leads to the imbalance of discretization scale among different depth ranges, resulting in the inexact depth range localization. In this article, to alleviate the imbalanced depth range discretization problem in classification-based monocular depth estimation (MDE) method we follow the coarse-to-fine principle and propose a novel depth range discretization method called depth post-discretization (DPD). Based on a coarse depth anchor roughly indicating the depth range, the DPD generates the depth range discretization adaptively for every position. The depth range discretization with DPD is more fine-grained around the actual depth, which is beneficial for locating the depth range more precisely for each scene position. Besides, to better manage the prediction of the coarse depth anchor and depth probability distribution for calculating the final depth, we design a dual-decoder transformer-based network, i.e., DPDFormer, which is more compatible with our proposed DPD method. We evaluate DPDFormer on popular depth datasets NYU Depth V2 and KITTI. The experimental results prove the superior performance of our proposed method.

References

[1]
Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. Multi-view depth estimation by fusing single-view depth probability with multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2842–2851.
[2]
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2021. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4009–4018.
[3]
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2022. LocalBins: Improving depth estimation by learning local distributions. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I. Springer, 480–496.
[4]
Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. 2017. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28, 11 (2017), 3174–3182.
[5]
Jia-Ren Chang and Yong-Sheng Chen. 2018. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5418.
[6]
Po-Yi Chen, Alexander H. Liu, Yen-Cheng Liu, and Yu-Chiang Frank Wang. 2019. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2624–2632.
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://arxiv.org/abs/2010.11929
[8]
David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision. 2650–2658.
[9]
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2366–2374.
[10]
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2002–2011.
[11]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 11 (2013), 1231–1237.
[12]
Andreas Geiger, Martin Roser, and Raquel Urtasun. 2010. Efficient large-scale stereo matching. In Proceedings of the Asian Conference on Computer Vision. Springer, 25–38.
[13]
Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828–3838.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[15]
Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 2 (2007), 328–341.
[16]
Hayato Itoh, Masahiro Oda, Yuichi Mori, Masashi Misawa, Shin-Ei Kudo, Kenichiro Imai, Sayo Ito, Kinichi Hotta, Hirotsugu Takabatake, Masaki Mori, Hiroshi Natori, and Kensaku Mori. 2021. Unsupervised colonoscopic depth estimation by domain translations with a lambertian-reflection keeping auxiliary task. International Journal of Computer Assisted Radiology and Surgery 16, 6 (2021), 989–1001.
[17]
Pan Ji, Runze Li, Bir Bhanu, and Yi Xu. 2021. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12787–12796.
[18]
Hyunyoung Jung, Eunhyeok Park, and Sungjoo Yoo. 2021. Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12642–12652.
[19]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980
[20]
Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. 2014. Pulling things out of perspective. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. 89–96.
[21]
Tristan Laidlow, Jan Czarnowski, and Stefan Leutenegger. 2019. DeepFusion: Real-time dense 3D reconstruction for monocular SLAM using single-view depth and gradient predictions. In Proceedings of the 2019 International Conference on Robotics and Automation. IEEE, 4068–4074.
[22]
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 4th International Conference on 3D Vision. IEEE, 239–248.
[23]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[24]
Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326. Retrieved from https://arxiv.org/abs/1907.10326
[25]
Runze Li, Pan Ji, Yi Xu, and Bir Bhanu. 2022. Monoindoor++: towards better practice of self-supervised monocular depth estimation for indoor environments. IEEE Transactions on Circuits and Systems for Video Technology 33, 2(2022), 830–846.
[26]
Yan Li, Qiong Wang, Lu Zhang, and Gauthier Lafruit. 2021. A lightweight depth estimation network for wide-baseline light fields. IEEE Transactions on Image Processing 30 (2021), 2288–2300.
[27]
Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. 2023. VA-depthnet: A variational approach to single image depth prediction. arXiv:2302.06556. Retrieved from https://arxiv.org/abs/2302.06556
[28]
Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2015), 2024–2039.
[29]
Siping Liu, Laurence Tianruo Yang, Xiaohan Tu, Renfa Li, and Cheng Xu. 2022. Lightweight monocular depth estimation on edge devices. IEEE Internet of Things Journal 9, 17 (2022), 16168–16180.
[30]
Yifan Liu, Changyong Shu, Jingdong Wang, and Chunhua Shen. 2020. Structured knowledge distillation for dense prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2020), 7035–7049.
[31]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
[32]
Adrian Lopez-Rodriguez and Krystian Mikolajczyk. 2023. Desc: Domain adaptation for depth estimation via semantic consistency. International Journal of Computer Vision 131, 3 (2023), 752–771.
[33]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 721, 8026–8037.
[34]
Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. 2022. P3depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1610–1621.
[35]
M. Ramamonjisoa and V. Lepetit. 2019. SharpNet: Fast and accurate recovery of occluding contours in monocular depth estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop. 2109–2118.
[36]
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12179–12188.
[37]
Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. 2021. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8555–8564.
[38]
Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018. Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 6 (2018), 1426–1440.
[39]
Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng. 2005. Learning depth from single monocular images. In Proceedings of the 18th International Conference on Neural Information Processing Systems.1161–1168.
[40]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Computer Vision, ECCV 2012-12th European Conference on Computer Vision, Proceedings. Springer, 746–760.
[41]
Minsoo Song, Seokjae Lim, and Wonjun Kim. 2021. Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE Transactions on Circuits and Systems for Video Technology 31, 11 (2021), 4381–4393.
[42]
Qiyu Sun, Yang Tang, Chongzhen Zhang, Chaoqiang Zhao, Feng Qian, and Jürgen Kurths. 2021. Unsupervised estimation of monocular depth and VO in dynamic environments via hybrid masks. IEEE Transactions on Neural Networks and Learning Systems 33, 5 (2021), 2023–2033.
[43]
Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. 2017. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6243–6252.
[44]
Madhu Vankadari, Sourav Garg, Anima Majumder, Swagat Kumar, and Ardhendu Behera. 2020. Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16. Springer, 443–459.
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.6000–6010.
[46]
Anjie Wang, Zhijun Fang, Yongbin Gao, Songchao Tan, Shanshe Wang, Siwei Ma, and Jenq-Neng Hwang. 2020. Adversarial learning for joint optimization of depth and ego-motion. IEEE Transactions on Image Processing 29 (2020), 4130–4142.
[47]
Yang Wang, Peng Wang, Zhenheng Yang, Chenxu Luo, Yi Yang, and Wei Xu. 2019. Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8071–8081.
[48]
Xianfa Xu, Zhe Chen, and Fuliang Yin. 2021. Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement. IEEE Transactions on Image Processing 30 (2021), 8811–8822.
[49]
Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, Yuanbo Wang, Hujun Bao, and Guofeng Zhang. 2020. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics 26, 12 (2020), 3446–3456.
[50]
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. 2018. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision. 767–783.
[51]
Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. 2019. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5525–5534.
[52]
Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. 2019. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5684–5693.
[53]
Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q. Weinberger. 2019. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv:1906.06310. Retrieved from https://arxiv.org/abs/1906.06310
[54]
Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. 2022. NeWCRFs: Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[55]
Yunhan Zhao, Shu Kong, Daeyun Shin, and Charless Fowlkes. 2020. Domain decluttering: Simplifying images to mitigate synthetic-real domain shift and improve depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3330–3340.

Index Terms

  1. DPDFormer: A Coarse-to-Fine Model for Monocular Depth Estimation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
    May 2024
    650 pages
    EISSN:1551-6865
    DOI:10.1145/3613634
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 January 2024
    Online AM: 25 December 2023
    Accepted: 17 December 2023
    Revised: 22 September 2023
    Received: 05 June 2023
    Published in TOMM Volume 20, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Monocular depth estimation
    2. coarse-to-fine
    3. depth post-discretization
    4. DPDformer

    Qualifiers

    • Research-article

    Funding Sources

    • Natural Science Foundation of China
    • HIT Assistant Professor Research Initiation Program

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 291
      Total Downloads
    • Downloads (Last 12 months)291
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media