MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

Authors

  • Revant Gangi Reddy University of Illinois at Urbana-Champaign
  • Xilin Rui Tsinghua University
  • Manling Li University of Illinois at Urbana-Champaign
  • Xudong Lin Columbia University
  • Haoyang Wen University of Illinois at Urbana-Champaign
  • Jaemin Cho University of North Carolina at Chapel Hill
  • Lifu Huang Virginia Tech
  • Mohit Bansal University of North Carolina at Chapel Hill
  • Avirup Sil IBM Research AI
  • Shih-Fu Chang Columbia University
  • Alexander Schwing University of Illinois at Urbana-Champaign
  • Heng Ji University of Illinois at Urbana-Champaign

DOI:

https://doi.org/10.1609/aaai.v36i10.21370

Keywords:

Speech & Natural Language Processing (SNLP)

Abstract

Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.

Downloads

Published

2022-06-28

How to Cite

Reddy, R. G., Rui, X., Li, M., Lin, X., Wen, H., Cho, J., Huang, L., Bansal, M., Sil, A., Chang, S.-F., Schwing, A., & Ji, H. (2022). MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11200-11208. https://doi.org/10.1609/aaai.v36i10.21370

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing