Overview
LLaVA (Large Language-and-Vision Assistant) is an end-to-end trained large multimodal model that combines a vision encoder (CLIP ViT-L/14) and a large language model (Vicuna). The architecture involves a two-stage instruction-tuning procedure: pre-training for feature alignment (updating only the projection matrix based on a subset of CC3M) and fine-tuning end-to-end (updating both the projection matrix and LLM). LLaVA is fine-tuned on generated multimodal instruction-following data for visual chat applications and multimodal reasoning datasets for science domains. It showcases impressive chat capabilities, mimicking the multimodal GPT-4, achieving state-of-the-art accuracy on Science QA tasks. The project aims to provide open-source models, data, and code for research purposes.
