short-paper

Animating Images to Transfer CLIP for Video-Text Retrieval

Authors:

Yu Liu,

Huai Chen,

Lianghua Huang,

Di Chen,

Bin Wang,

Pan Pan,

Lisheng WangAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1906 - 1911

https://doi.org/10.1145/3477495.3531776

Published: 07 July 2022 Publication History

Get Access

Abstract

Recent works show the possibility of transferring the CLIP (Contrastive Language-Image Pretraining) model for video-text retrieval with promising performance. However, due to the domain gap between static images and videos, CLIP-based video-text retrieval models with interaction-based matching perform far worse than models with representation-based matching. In this paper, we propose a novel image animation strategy to transfer the image-text CLIP model to video-text retrieval effectively. By imitating the video shooting components, we convert widely used image-language corpus to synthesized video-text data for pretraining. To reduce the time complexity of interaction matching, we further propose a coarse to fine framework which consists of dual encoders for fast candidates searching and a cross-modality interaction module for fine-grained re-ranking. The coarse to fine framework with the synthesized video-text pretraining provides significant gains in retrieval accuracy while preserving efficiency. Comprehensive experiments conducted on MSR-VTT, MSVD, and VATEX datasets demonstrate the effectiveness of our approach.

Supplementary Material

MP4 File (SIGIR22-fp12345.mp4)

Recent works show the possibility of transferring the Image-Language Pretrained CLIP model for video-text retrieval with promising performance. However, due to the domain gap between static images and videos, CLIP-based video-text retrieval models with interaction-based matching perform far worse than models with representation-based matching. In this work, we propose a novel image animation strategy to transfer the image-text CLIP model to video-text retrieval effectively. By imitating the video shooting components, we convert widely used image-language corpus to synthesized video-text data for pretraining. To reduce the time complexity of interaction matching, we further propose a coarse to fine framework which consists of dual encoders for fast candidates searching and a cross-modality interaction module for fine-grained re-ranking. The coarse to fine framework with the synthesized video-text pretraining provides significant gains in retrieval accuracy while preserving efficiency.

Download
34.66 MB

References

[1]

Max Bain, Arsha Nagrani, Gul Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 1728--1738.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP

Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval

CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations