multimodalPublished: April 17, 2023

Visual instruction tuning

By Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Research TL;DR

"Pioneers multimodal instruction tuning by connecting CLIP vision encoders with LLaMA. Lays the groundwork for open-source visual assistants like LLaVA."

Abstract

Instruction tuning raw text models has shown great success. In this paper, we present the first attempt to use GPT-4 to generate multimodal instruction-following data, based on which we connect a vision encoder and LLM to build LLaVA, a general-purpose multimodal assistant.

Read full paper on arXiv →

Related Research

Jan 2024

MAMMOTH: Massive multimodal helper for multi-discipline reasoning

Read Synopsis →