short-paper

Open access

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Authors:

Nicola Messina,

Jan Sedmidubsky,

Fabrizio Falchi,

Tomás RebokAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2420 - 2425

https://doi.org/10.1145/3539618.3592069

Published: 18 July 2023 Publication History

PDF eReader

Abstract

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Supplemental Material

MP4 File

Pose-estimation methods can extract human motion from a common video in the form of structured skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal motion data still remains a challenging problem. Inspired by the success of cross-modal approaches in image and video domains, we propose the new task of text-to-motion retrieval, which aims at identifying motions that are the most relevant to a text query specified in a natural language description. We also propose a new Motion Transformer encoder network that outperforms baseline approaches on KIT Motion Language and HumanML3D text-motion datasets.

Download
140.26 MB

References

[1]

Emre Aksan, Manuel Kaufmann, Peng Cao, and Otmar Hilliges. 2020. A Spatio-temporal Transformer for 3D Human Motion Prediction. arXiv (2020). https://doi.org/10.48550/ARXIV.2004.08692

Abstract

Supplemental Material

References

Index Terms

Recommendations

Motion retrieval based on movement notation language: Motion Capture and Retrieval

Motion Data Retrieval from Very Large Motion Databases

Visual Exploration of Human Motion Data

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations