VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

Ravi, Sahithya; Chinchure, Aditya; Sigal, Leonid; Liao, Renjie; Shwartz, Vered

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.13626 (cs)

[Submitted on 24 Oct 2022]

Title:VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

Authors:Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, Vered Shwartz

View PDF

Abstract:There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed analysis, we explain which questions benefit, and which don't, from contextualized commonsense knowledge from COMET.

Comments:	Accepted at WACV 2023. For code and supplementary material, see this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2210.13626 [cs.CV]
	(or arXiv:2210.13626v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.13626

Submission history

From: Aditya Aravind Chinchure [view email]
[v1] Mon, 24 Oct 2022 22:01:17 UTC (4,661 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators