skip to main content
10.1145/3539618.3592069acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open access

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Published: 18 July 2023 Publication History

Abstract

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Supplemental Material

MP4 File
Pose-estimation methods can extract human motion from a common video in the form of structured skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal motion data still remains a challenging problem. Inspired by the success of cross-modal approaches in image and video domains, we propose the new task of text-to-motion retrieval, which aims at identifying motions that are the most relevant to a text query specified in a natural language description. We also propose a new Motion Transformer encoder network that outperforms baseline approaches on KIT Motion Language and HumanML3D text-motion datasets.

References

[1]
Emre Aksan, Manuel Kaufmann, Peng Cao, and Otmar Hilliges. 2020. A Spatio-temporal Transformer for 3D Human Motion Prediction. arXiv (2020). https://doi.org/10.48550/ARXIV.2004.08692
[2]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382--398.
[3]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvcić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In IEEE/CVF International Conference on Computer Vision (ICCV). 6836--6846.
[4]
J. Bernard, N. Wilhelm, B. Krü ger, T. May, T. Schreck, and J. Kohlhammer. 2013. MotionExplorer: Exploratory Search in Human Motion Capture Data Based on Hierarchical Aggregation. IEEE Transactions on Visualization and Computer Graphics, Vol. 19, 12 (2013), 2257--2266.
[5]
Petra Budikova, Jan Sedmidubsky, and Pavel Zezula. 2021. Efficient Indexing of 3D Human Motions. In International Conference on Multimedia Retrieval (ICMR). ACM, 10--18. https://dl.acm.org/doi/10.1145/3460426.3463646
[6]
Fabio Carrara, Petr Elias, Jan Sedmidubsky, and Pavel Zezula. 2019. LSTM-based real-time action detection and prediction in human motion streams. Multimedia Tools and Applications, Vol. 78, 19 (2019), 27309--27331. https://doi.org/10.1007/s11042-019-07827--3
[7]
Fabio Carrara, Andrea Esuli, Tiziano Fagni, Fabrizio Falchi, and Alejandro Moreo Fernández. 2018. Picture it in your mind: Generating high level visual representations from textual descriptions. Information Retrieval Journal, Vol. 21, 2 (2018), 208--229.
[8]
Yi-Bin Cheng, Xipeng Chen, Junhong Chen, Pengxu Wei, Dongyu Zhang, and Liang Lin. 2021a. Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition. In IEEE International Conference on Multimedia and Expo (ICME). 1--6. https://doi.org/10.1109/ICME51207.2021.9428459
[9]
Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, and Liang Lin. 2021b. Motion-Transformer: Self-Supervised Pre-Training for Skeleton-Based Action Recognition. In 2nd ACM International Conference on Multimedia in Asia (MMAsia). ACM, New York, NY, USA. https://doi.org/10.1145/3444685.3446289
[10]
Z. Deng, Q. Gu, and Q. Li. 2009. Perceptually consistent example-based human motion retrieval. In Symposium on Interactive 3D Graphics (SI3D). ACM, 191--198.
[11]
Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. 2022. DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition. arXiv (2022). https://doi.org/10.48550/ARXIV.2210.05895
[12]
Shradha Dubey and Manish Dixit. 2022. A comprehensive survey on human pose estimation approaches. Multimedia Systems (2022), 1--29. https://doi.org/10.1007/s00530-022-00980-0
[13]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).
[14]
Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396--1406.
[15]
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022a. Generating Diverse and Natural 3D Human Motions From Text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152--5161.
[16]
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022b. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 580--597.
[17]
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021--2029.
[18]
M. Kapadia, I-K. Chiang, T. Thomas, N.I. Badler, and J. T. Kider Jr. 2013. Efficient motion retrieval in large motion databases. In Symposium on Interactive 3D Graphics and Games (I3D). ACM, 19--28.
[19]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.
[20]
Jihoon Kim, Youngjae Yu, Seungyoun Shin, Taehyun Byun, and Sungjoon Choi. 2022. Learning Joint Representation of Human Motion and Language. arXiv preprint arXiv:2210.15187 (2022).
[21]
Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. 2020. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In 28th ACM International Conference on Multimedia (MM). ACM, New York, NY, USA, 2490--2498. https://doi.org/10.1145/3394171.3413548
[22]
Yu Liu, Huai Chen, Lianghua Huang, Di Chen, Bin Wang, Pan Pan, and Lisheng Wang. 2022. Animating Images to Transfer CLIP for Video-Text Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1906--1911.
[23]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, Vol. 508 (2022), 293--304.
[24]
Na Lv, Ying Wang, Zhiquan Feng, and Jingliang Peng. 2021. Deep Hashing for Motion Capture Data Retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2215--2219. https://doi.org/10.1109/ICASSP39728.2021.9413505
[25]
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442--5451.
[26]
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021a. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 17, 4 (2021), 1--23.
[27]
Nicola Messina, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021b. Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features. In 2021 International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 1--6.
[28]
Nicola Messina, Davide Alessandro Coccomini, Andrea Esuli, and Fabrizio Falchi. 2022a. Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching. arXiv preprint arXiv:2206.10436 (2022).
[29]
Nicola Messina, Fabrizio Falchi, Andrea Esuli, and Giuseppe Amato. 2021c. Transformer reasoning network for image-text matching and retrieval. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 5222--5229.
[30]
Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, and Rita Cucchiara. 2022b. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing. 64--70.
[31]
N. Numaguchi, A. Nakazawa, T. Shiratori, and J. K. Hodgins. 2011. A Puppet Interface for Retrieval of Motion Capture Data. In Eurographics/ACM SIGGRAPH Symposium on Computer Animation (SCA). Eurographics Assoc., 157--166.
[32]
Konstantinos Papadopoulos, Enjie Ghorbel, Renato Baptista, Djamila Aouada, and Bjö rn E. Ottersten. 2019. Two-Stage RGB-Based Action Detection Using Augmented 3D Poses. In 18th International Conference on Computer Analysis of Images and Patterns (CAIP), Vol. 11678. Springer, 26--35. https://doi.org/10.1007/978--3-030--29888--3_3
[33]
Wei Peng, Xiaopeng Hong, and Guoying Zhao. 2021. Tripool: Graph Triplet Pooling for 3D Skeleton-Based Action Recognition. Pattern Recognition, Vol. 115 (2021), 107921. https://doi.org/10.1016/j.patcog.2021.107921
[34]
Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis With Transformer VAE. In IEEE/CVF International Conference on Computer Vision (ICCV). 10985--10995.
[35]
Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT Motion-Language Dataset. Big Data, Vol. 4, 4 (2016), 236--252. https://doi.org/10.1089/big.2016.0028
[36]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv (2021). https://doi.org/10.48550/ARXIV.2103.00020
[37]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[38]
Jan Sedmidubsky, Petra Budikova, Vlastislav Dohnal, and Pavel Zezula. 2020. Motion Words: A Text-like Representation of 3D Skeleton Sequences. In 42nd European Conference on Information Retrieval (ECIR). Springer, 527--541.
[39]
Jan Sedmidubsky, Fabio Carrara, and Giuseppe Amato. 2023. SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval. In 45th European Conference on Information Retrieval (ECIR). Springer, Cham, 110--124. https://doi.org/10.1007/978--3-031--28238--6_8
[40]
Jan Sedmidubsky, Petr Elias, Petra Budikova, and Pavel Zezula. 2021. Content-based Management of Human Motion Data: Survey and Challenges. IEEE Access, Vol. 9 (2021), 64241--64255. https://doi.org/10.1109/ACCESS.2021.3075766
[41]
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020--20029.
[42]
Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2018. Spatio-Temporal Attention-Based LS™ Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing, Vol. 27, 7 (2018), 3459--3471. https://doi.org/10.1109/TIP.2018.2818328
[43]
Ömer Terlemez, Stefan Ulbrich, Christian Mandery, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. 2014. Master Motor Map (MMM)-Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In 2014 IEEE-RAS International Conference on Humanoid Robots. IEEE, 894--901.
[44]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. 2022. Human Motion Diffusion Model. arXiv (2022), 1--12. https://doi.org/10.48550/ARXIV.2209.14916
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[46]
Yang Yang, Guangjun Liu, and Xuehao Gao. 2022. Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2022), 1--13. https://doi.org/10.1109/TCSVT.2022.3194350
[47]
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. 2023. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv (2023), 1--14. https://doi.org/10.48550/ARXIV.2301.06052
[48]
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv (2022), 1--16. https://doi.org/10.48550/ARXIV.2208.15001
[49]
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2020. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020).
[50]
Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. 2022. Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 970--981.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Check for updates

Badges

  • Honorable Mention Short Paper

Author Tags

  1. BERT
  2. CLIP
  3. ViViT
  4. cross-modal retrieval
  5. deep language models
  6. human motion data
  7. motion retrieval
  8. skeleton sequences

Qualifiers

  • Short-paper

Funding Sources

  • AI4Media - A European Excellence Centre for Media, Society, and Democracy
  • SUN - Social and hUman ceNtered XR
  • ERDF CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence

Conference

SIGIR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 714
    Total Downloads
  • Downloads (Last 12 months)603
  • Downloads (Last 6 weeks)48
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media