Dive into Vision Language Models

less than 1 minute read

Published:

Since 2021, we’ve seen an increased interest in models that combine vision and language modalities (Joint Vision-Language models). VLM have shown particulary impressice capabilities in very challenging tasks such as image captioning, text-guided image generation and manipulation, and visual question-answering.

In this blog post, I will brifely give a high-level description of everything you need to know about Joint Vision-Language models.

Introduction

What does it mean to call a model a “Vision-Language” Model?