skip to main content
10.1145/3503161.3547828acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CycleHand: Increasing 3D Pose Estimation Ability on In-the-wild Monocular Image through Cyclic Flow

Published: 10 October 2022 Publication History

Abstract

Current methods for 3D hand pose estimation fail to generalize well to in-the-wild new scenarios due to varying camera viewpoints, self-occlusions, and complex environments. To address this problem, we propose CycleHand to improve the generalization ability of the model in a self-supervised manner. Our motivation is based on an observation: if one globally rotates the whole hand and reversely rotates it back, the estimated 3D poses of fingers should keep consistent before and after the rotation because the wrist-relative hand poses stay unchanged during global 3D rotation. Hence, we propose arbitrary-rotation self-supervised consistency learning to improve the model's robustness for varying viewpoints. Another innovation of CycleHand is that we propose a high-fidelity texture map to render the photorealistic rotated hand with different lighting conditions, backgrounds, and skin tones to further enhance the effectiveness of our self-supervised task. To reduce the potential negative effects brought by the domain shift of synthetic images, we use the idea of contrastive learning to learn a synthetic-real consistent feature extractor in extracting domain-irrelevant hand representations. Experiments show that CycleHand can largely improve the hand pose estimation performance in both canonical datasets and real-world applications.

Supplementary Material

MP4 File (MM22-358.mp4)
CycleHand is aim to enhance the current 3D hand pose estimation network by improving its in-the-wild performance. The core of CycleHand is simple and straightforward: involves the hard view (hand crop taken under severe viewpoint) rendered image into the training process. We utilize neural rendering to help us achieve this. Moreover, to avoid the mesh penetration problem, we come up with some novel mechanical constraints to solve this problem elegantly. Hope you enjoy our video!

References

[1]
Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 2019. 3d hand shape and pose from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10843--10852.
[2]
Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. 2018. Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images. In Computer Vision -- ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 678--694.
[3]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9650--9660.
[4]
Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, Stefan Stojanov, and James M Rehg. 2019. Unsupervised 3d pose estimation with geometric selfsupervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5714--5724.
[5]
Ping Chen, Yujin Chen, Dong Yang, Fangyin Wu, Qin Li, Qingpei Xia, and Yong Tan. 2021. I2uv-handnet: Image-to-uv prediction network for accurate and highfidelity 3d hand mesh modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12929--12938.
[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.
[7]
Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2016. Synthesizing training images for boosting human 3d pose estimation. In 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 479--488.
[8]
Xingyu Chen, Yufeng Liu, Yajiao Dong, Xiong Zhang, Chongyang Ma, Yanmin Xiong, Yuan Zhang, and Xiaoyan Guo. 2021. MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image. arXiv preprint arXiv:2112.02753 (2021).
[9]
Yujin Chen, Zhigang Tu, Di Kang, Linchao Bao, Ying Zhang, Xuefei Zhe, Ruizhi Chen, and Junsong Yuan. 2021. Model-based 3D hand reconstruction via self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10451--10460.
[10]
Rishabh Dabral, Anurag Mundhada, Uday Kusupati, Safeer Afaque, Abhishek Sharma, and Arjun Jain. 2018. Learning 3d human pose from structure and motion. In Proceedings of the European Conference on Computer Vision (ECCV). 668--683.
[11]
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0--0.
[12]
Dylan Drover, Ching-Hang Chen, Amit Agrawal, Ambrish Tyagi, and Cong Phuoc Huynh. 2018. Can 3d pose be learned from 2d projections alone?. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.
[13]
Dylan Drover, Ching-Hang Chen, Amit Agrawal, Ambrish Tyagi, and Cong Phuoc Huynh. 2018. Can 3d pose be learned from 2d projections alone?. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0--0.
[14]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. Density-based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining, Vol. 240. 6.
[15]
Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. 2021. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1--13.
[16]
Daiheng Gao, Bang Zhang, Qi Wang, Xindi Zhang, Pan Pan, and Yinghui Xu. 2021. SCAT: Stride Consistency with Auto-regressive regressor and Transformer for hand pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2266--2275.
[17]
Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and Junsong Yuan. 2019. 3D Hand Shape and Pose Estimation from a Single RGB Image. arXiv:1903.00812 [cs.CV]
[18]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. 2019. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations. https://openreview.net/forum?id=Bygh9j09KX
[19]
Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vincent Lepetit. 2020. HOnnotate: A method for 3D Annotation of Hand and Object Poses. arXiv:1907.01481 [cs.CV]
[20]
Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid. 2020. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 571--580.
[21]
Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J Black, Ivan Laptev, and Cordelia Schmid. 2019. Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11807--11816.
[22]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[24]
Justin Johnson, Nikhila Ravi, Jeremy Reizenstein, David Novotny, Shubham Tulsiani, Christoph Lassner, and Steve Branson. 2020. Accelerating 3d deep learning with pytorch3d. In SIGGRAPH Asia 2020 Courses. 1--1.
[25]
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
[26]
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1--13.
[27]
Muhammed Kocabas, Salih Karagoz, and Emre Akbas. 2019. Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1077--1086.
[28]
Yang Li, Kan Li, Shuai Jiang, Ziyue Zhang, Congzhentao Huang, and Richard Yi Da Xu. 2020. Geometry-Driven Self-Supervised Method for 3D Human Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 07 (Apr. 2020), 11442--11449. https://doi.org/10.1609/aaai.v34i07.6808
[29]
Kevin Lin, LijuanWang, and Zicheng Liu. 2021. Mesh graphormer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12939--12948.
[30]
Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. 2019. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7708--7717.
[31]
Jun Lv, Wenqiang Xu, Lixin Yang, Sucheng Qian, Chongzhao Mao, and Cewu Lu. 2021. HandTailor: Towards High-Precision Monocular 3D Hand Recovery. arXiv preprint arXiv:2102.09244 (2021).
[32]
Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1--14.
[33]
Gyeongsik Moon and Kyoung Mu Lee. 2020. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision. Springer, 752--768.
[34]
David Novotny, Nikhila Ravi, Benjamin Graham, Natalia Neverova, and Andrea Vedaldi. 2019. C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7688--7697.
[35]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[36]
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and MichaelAuli. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. arXiv:1811.11742 [cs.CV]
[37]
Aleksis Pirinen, Erik Gärtner, and C. Sminchisescu. 2019. Domes to Drones: Self-Supervised Active Triangulation for 3D Human Pose Reconstruction. In NeurIPS.
[38]
Neng Qian, Jiayi Wang, Franziska Mueller, Florian Bernard, Vladislav Golyanik, and Christian Theobalt. 2020. Parametric Hand Texture Model for 3D Hand Reconstruction and Personalization. (2020).
[39]
Helge Rhodin, Mathieu Salzmann, and Pascal Fua. 2018. Unsupervised geometry-aware representation for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV). 750--767.
[40]
Javier Romero, Dimitrios Tzionas, and Michael J Black. 2017. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics (ToG) 36, 6 (2017), 1--17.
[41]
Yu Rong, Takaaki Shiratori, and Hanbyul Joo. 2021. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1749--1759.
[42]
Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. 2004. "GrabCut": Interactive Foreground Extraction Using Iterated Graph Cuts. In ACM SIGGRAPH 2004 Papers (Los Angeles, California) (SIGGRAPH '04). Association for Computing Machinery, New York, NY, USA, 309--314. https://doi.org/10.1145/1186562.1015720
[43]
Robert J Schwarz and C Taylor. 1955. The anatomy and mechanics of the human hand. Artificial limbs 2, 2 (1955), 22--35.
[44]
Michael Seeber, Roi Poranne, Marc Polleyfeys, and Martin R Oswald. 2021. RealisticHands: A Hybrid Model for 3D Hand Reconstruction. In 2021 International Conference on 3D Vision (3DV). IEEE, 22--31.
[45]
Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. 2020. Understanding human hands in contact at internet scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9869--9878.
[46]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. http://arxiv.org/abs/1409.1556 cite arxiv:1409.1556.
[47]
Adrian Spurr, Aneesh Dahiya, XiWang, Xucong Zhang, and Otmar Hilliges. 2021. Self-Supervised 3D Hand Pose Estimation from monocular RGB via Contrastive Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11230--11239.
[48]
Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, and Jan Kautz. 2020. Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints. In Computer Vision--ECCV 2020 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVII, Vol. 12362. Springer, 211--228.
[49]
A.C. Telea. 2004. An image inpainting technique based on the Fast Marching Method. Journal of Graphics Tools 9, 1 (2004), 23--34. https://doi.org/10.1080/10867651.2004.10487596
[50]
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, et al. 2021. MLP-Mixer: An all-MLP architecture for vision. arXiv preprint arXiv:2105.01601 (2021).
[51]
Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 109--117.
[52]
BastianWandt and Bodo Rosenhahn. 2019. Repnet:Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7782--7791.
[53]
Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Image Inpainting via Generative Multi-column Convolutional Neural Networks. In Advances in Neural Information Processing Systems. 331--340.
[54]
Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 2018. 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5255--5264.
[55]
Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020).
[56]
Jiawei Zhang, Jianbo Jiao, Mingliang Chen, Liangqiong Qu, Xiaobin Xu, and Qingxiong Yang. 2016. 3D Hand Pose Tracking and Estimation Using Stereo Matching. arXiv:1610.07214 [cs.CV]
[57]
Jianfeng Zhang, Xuecheng Nie, and Jiashi Feng. 2020. Inference stage optimization for cross-scenario 3d human pose estimation. Advances in Neural Information Processing Systems 33 (2020), 2408--2419.
[58]
Xiong Zhang, Hongsheng Huang, Jianchao Tan, Hongmin Xu, Cheng Yang, Guozhu Peng, Lei Wang, and Ji Liu. 2021. Hand Image Understanding via Deep Multi-Task Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11281--11292.
[59]
Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision. 398--407.
[60]
Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision. 398--407.
[61]
Christian Zimmermann, Max Argus, and Thomas Brox. 2021. Contrastive Representation Learning for Hand Shape Estimation. In DAGM German Conference on Pattern Recognition. Springer, 250--264.
[62]
Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. 2019. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 813--822.

Cited By

View all
  • (2023)CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware PromptingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612390(4896-4907)Online publication date: 26-Oct-2023
  • (2023)Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00839(8683-8693)Online publication date: Jun-2023

Index Terms

  1. CycleHand: Increasing 3D Pose Estimation Ability on In-the-wild Monocular Image through Cyclic Flow

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D pose estimation
    2. domain adaption
    3. hand
    4. texture

    Qualifiers

    • Research-article

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware PromptingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612390(4896-4907)Online publication date: 26-Oct-2023
    • (2023)Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00839(8683-8693)Online publication date: Jun-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media