# Multimodal LLM History: From Radford 2021 CLIP to GPT-4o — Birth of 'Seeing' Language Models

> Source: https://sukruyusufkaya.com/en/learn/llm-muhendisligi/multimodal-tarihce-clip-gpt-4o
> Updated: 2026-06-26T02:08:19.571Z
> Category: LLM Mühendisliği
> Module: Module 19: Multimodal Models — Image + Audio + Video
**TLDR:** Historical and conceptual anatomy of multimodal LLMs: Radford et al. 2021 CLIP paper — birth of image-text alignment via contrastive learning, ViT (Dosovitskiy 2020) image transformer, BLIP (Li 2022), Flamingo (Alayrac 2022), LLaVA (Liu 2023) open-source breakthrough, GPT-4V (Sept 2023), GPT-4o (May 2024) unified omni-modal, Llama-3.2 Vision (Sept 2024) open-source. 5-year 'language + image' fusion journey and what multimodal means for Turkish (Turkish document OCR, cultural visual understanding).