Dive into Vision Language Models
Published:
Since 2021, we’ve seen an increased interest in models that combine vision and language modalities (Joint Vision-Language models). VLM have shown particulary impressice capabilities in very challenging tasks such as image captioning, text-guided image generation and manipulation, and visual question-answering.
In this blog post, I will brifely give a high-level description of everything you need to know about Joint Vision-Language models.
Introduction
What does it mean to call a model a “Vision-Language” Model?