How to Configure Good In-Context Sequence for Visual Question Answering

Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, Xu Yang

Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However when implementing ICL using these LVLMs researchers usually resort to the simplest way like random sampling to configure the in-context sequence thus leading to sub-optimal results. To enhance the ICL performance in this study we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally through observing the changes of the LVLM outputs by altering the in-context sequence we gain insights into the inner properties of LVLMs improving our understanding of them. Specifically to explore in-context configurations we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2 VizWiz and OK-VQA we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https: //github.com/GaryJiajia/OFv2_ICL_VQA.