research-article

DPDFormer: A Coarse-to-Fine Model for Monocular Depth Estimation

Authors:

Tianyi ZangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 5

Article No.: 139, Pages 1 - 21

https://doi.org/10.1145/3638559

Published: 22 January 2024 Publication History

Abstract

Monocular depth estimation attracts great attention from computer vision researchers for its convenience in acquiring environment depth information. Recently classification-based MDE methods show its promising performance and begin to act as an essential role in many multi-view applications such as reconstruction and 3D object detection. However, existed classification-based MDE models usually apply fixed depth range discretization strategy across a whole scene. This fixed depth range discretization leads to the imbalance of discretization scale among different depth ranges, resulting in the inexact depth range localization. In this article, to alleviate the imbalanced depth range discretization problem in classification-based monocular depth estimation (MDE) method we follow the coarse-to-fine principle and propose a novel depth range discretization method called depth post-discretization (DPD). Based on a coarse depth anchor roughly indicating the depth range, the DPD generates the depth range discretization adaptively for every position. The depth range discretization with DPD is more fine-grained around the actual depth, which is beneficial for locating the depth range more precisely for each scene position. Besides, to better manage the prediction of the coarse depth anchor and depth probability distribution for calculating the final depth, we design a dual-decoder transformer-based network, i.e., DPDFormer, which is more compatible with our proposed DPD method. We evaluate DPDFormer on popular depth datasets NYU Depth V2 and KITTI. The experimental results prove the superior performance of our proposed method.

References

[1]

Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. Multi-view depth estimation by fusing single-view depth probability with multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2842–2851.

[2]

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2021. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4009–4018.

[3]

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2022. LocalBins: Improving depth estimation by learning local distributions. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I. Springer, 480–496.

Digital Library

[4]

Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. 2017. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28, 11 (2017), 3174–3182.

Digital Library

[5]

Jia-Ren Chang and Yong-Sheng Chen. 2018. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5418.

[6]

Po-Yi Chen, Alexander H. Liu, Yen-Cheng Liu, and Yu-Chiang Frank Wang. 2019. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2624–2632.

[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://arxiv.org/abs/2010.11929

[8]

David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision. 2650–2658.

Digital Library

[9]

David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2366–2374.

Digital Library

[10]

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2002–2011.

[11]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 11 (2013), 1231–1237.

Digital Library

[12]

Andreas Geiger, Martin Roser, and Raquel Urtasun. 2010. Efficient large-scale stereo matching. In Proceedings of the Asian Conference on Computer Vision. Springer, 25–38.

[13]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828–3838.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[15]

Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 2 (2007), 328–341.

Digital Library

[16]

Hayato Itoh, Masahiro Oda, Yuichi Mori, Masashi Misawa, Shin-Ei Kudo, Kenichiro Imai, Sayo Ito, Kinichi Hotta, Hirotsugu Takabatake, Masaki Mori, Hiroshi Natori, and Kensaku Mori. 2021. Unsupervised colonoscopic depth estimation by domain translations with a lambertian-reflection keeping auxiliary task. International Journal of Computer Assisted Radiology and Surgery 16, 6 (2021), 989–1001.

[17]

Pan Ji, Runze Li, Bir Bhanu, and Yi Xu. 2021. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12787–12796.

[18]

Hyunyoung Jung, Eunhyeok Park, and Sungjoo Yoo. 2021. Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12642–12652.

[19]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980

[20]

Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. 2014. Pulling things out of perspective. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. 89–96.

Digital Library

[21]

Tristan Laidlow, Jan Czarnowski, and Stefan Leutenegger. 2019. DeepFusion: Real-time dense 3D reconstruction for monocular SLAM using single-view depth and gradient predictions. In Proceedings of the 2019 International Conference on Robotics and Automation. IEEE, 4068–4074.

Digital Library

[22]

Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 4th International Conference on 3D Vision. IEEE, 239–248.

[23]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.

[24]

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326. Retrieved from https://arxiv.org/abs/1907.10326

[25]

Runze Li, Pan Ji, Yi Xu, and Bir Bhanu. 2022. Monoindoor++: towards better practice of self-supervised monocular depth estimation for indoor environments. IEEE Transactions on Circuits and Systems for Video Technology 33, 2(2022), 830–846.

[26]

Yan Li, Qiong Wang, Lu Zhang, and Gauthier Lafruit. 2021. A lightweight depth estimation network for wide-baseline light fields. IEEE Transactions on Image Processing 30 (2021), 2288–2300.

Digital Library

[27]

Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. 2023. VA-depthnet: A variational approach to single image depth prediction. arXiv:2302.06556. Retrieved from https://arxiv.org/abs/2302.06556

[28]

Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2015), 2024–2039.

Digital Library

[29]

Siping Liu, Laurence Tianruo Yang, Xiaohan Tu, Renfa Li, and Cheng Xu. 2022. Lightweight monocular depth estimation on edge devices. IEEE Internet of Things Journal 9, 17 (2022), 16168–16180.

[30]

Yifan Liu, Changyong Shu, Jingdong Wang, and Chunhua Shen. 2020. Structured knowledge distillation for dense prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2020), 7035–7049.

Digital Library

[31]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.

[32]

Adrian Lopez-Rodriguez and Krystian Mikolajczyk. 2023. Desc: Domain adaptation for depth estimation via semantic consistency. International Journal of Computer Vision 131, 3 (2023), 752–771.

Digital Library

[33]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 721, 8026–8037.

[34]

Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. 2022. P3depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1610–1621.

[35]

M. Ramamonjisoa and V. Lepetit. 2019. SharpNet: Fast and accurate recovery of occluding contours in monocular depth estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop. 2109–2118.

[36]

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12179–12188.

[37]

Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. 2021. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8555–8564.

[38]

Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018. Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 6 (2018), 1426–1440.

[39]

Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng. 2005. Learning depth from single monocular images. In Proceedings of the 18th International Conference on Neural Information Processing Systems.1161–1168.

Digital Library

[40]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Computer Vision, ECCV 2012-12th European Conference on Computer Vision, Proceedings. Springer, 746–760.

Digital Library

[41]

Minsoo Song, Seokjae Lim, and Wonjun Kim. 2021. Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE Transactions on Circuits and Systems for Video Technology 31, 11 (2021), 4381–4393.

Digital Library

[42]

Qiyu Sun, Yang Tang, Chongzhen Zhang, Chaoqiang Zhao, Feng Qian, and Jürgen Kurths. 2021. Unsupervised estimation of monocular depth and VO in dynamic environments via hybrid masks. IEEE Transactions on Neural Networks and Learning Systems 33, 5 (2021), 2023–2033.

[43]

Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. 2017. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6243–6252.

[44]

Madhu Vankadari, Sourav Garg, Anima Majumder, Swagat Kumar, and Ardhendu Behera. 2020. Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16. Springer, 443–459.

Digital Library

[45]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.6000–6010.

[46]

Anjie Wang, Zhijun Fang, Yongbin Gao, Songchao Tan, Shanshe Wang, Siwei Ma, and Jenq-Neng Hwang. 2020. Adversarial learning for joint optimization of depth and ego-motion. IEEE Transactions on Image Processing 29 (2020), 4130–4142.

Digital Library

[47]

Yang Wang, Peng Wang, Zhenheng Yang, Chenxu Luo, Yi Yang, and Wei Xu. 2019. Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8071–8081.

[48]

Xianfa Xu, Zhe Chen, and Fuliang Yin. 2021. Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement. IEEE Transactions on Image Processing 30 (2021), 8811–8822.

Digital Library

[49]

Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, Yuanbo Wang, Hujun Bao, and Guofeng Zhang. 2020. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics 26, 12 (2020), 3446–3456.

[50]

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. 2018. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision. 767–783.

Digital Library

[51]

Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. 2019. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5525–5534.

[52]

Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. 2019. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5684–5693.

[53]

Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q. Weinberger. 2019. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv:1906.06310. Retrieved from https://arxiv.org/abs/1906.06310

[54]

Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. 2022. NeWCRFs: Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[55]

Yunhan Zhao, Shu Kong, Daeyun Shin, and Charless Fowlkes. 2020. Domain decluttering: Simplifying images to mitigate synthetic-real domain shift and improve depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3330–3340.

Index Terms

DPDFormer: A Coarse-to-Fine Model for Monocular Depth Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Learning Occlusion-aware Coarse-to-Fine Depth Map for Self-supervised Monocular Depth Estimation
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Self-supervised monocular depth estimation, aiming to learn scene depths from single images in a self-supervised manner, has received much attention recently. In spite of recent efforts in this field, how to learn accurate scene depths and alleviate the ...
Depth Map Decomposition for Monocular Depth Estimation
Computer Vision – ECCV 2022
Abstract
We propose a novel algorithm for monocular depth estimation that decomposes a metric depth map into a normalized depth map and scale features. The proposed network is composed of a shared encoder and three decoders, called G-Net, N-Net, and M-Net, ...
Transferring knowledge from monocular completion for self-supervised monocular depth estimation
Abstract
Monocular depth estimation is a very challenging task in computer vision, with the goal to predict per-pixel depth from a single RGB image. Supervised learning methods require large amounts of depth measurement data, which are time-consuming and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 5

May 2024

650 pages

EISSN:1551-6865

DOI:10.1145/3613634

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2024

Online AM: 25 December 2023

Accepted: 17 December 2023

Revised: 22 September 2023

Received: 05 June 2023

Published in TOMM Volume 20, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of China
HIT Assistant Professor Research Initiation Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
291
Total Downloads

Downloads (Last 12 months)291
Downloads (Last 6 weeks)15

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents