VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Wang, Jiapeng; Wang, Chengyu; Huang, Kunzhe; Huang, Jun; Jin, Lianwen

Computer Science > Computation and Language

arXiv:2410.00741 (cs)

[Submitted on 1 Oct 2024 (v1), last revised 4 Oct 2024 (this version, v2)]

Title:VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Authors:Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin

View PDF HTML (experimental)

Abstract:Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.

Comments:	EMNLP 2024 Main conference
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2410.00741 [cs.CL]
	(or arXiv:2410.00741v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.00741

Submission history

From: Jiapeng Wang [view email]
[v1] Tue, 1 Oct 2024 14:33:22 UTC (2,427 KB)
[v2] Fri, 4 Oct 2024 16:10:38 UTC (2,208 KB)

Computer Science > Computation and Language

Title:VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators