Vision-Language Models: From CLIP to GPT-4o — Image Encoder + LLM Fusion
Vision-Language Models (VLM) anatomy: CLIP (Radford 2021) image-text alignment, image patch embedding (ViT), projection layer to LLM, GPT-4V (Sept 2023), GPT-4o (May 2024) unified, Llama-3.2 Vision (Sept 2024) open-source. Architecture: image encoder + projection + LLM. Turkish multimodal practice.
Şükrü Yusuf KAYA
75 min read
Advanced👁️ Vision-Language — LLM'in 'gören' versiyonu
GPT-4V (Eylül 2023) lansmanı. Kullanıcı bir fotoğraf gönderiyor, model anlatıyor — ChatGPT artık 'görüyor'. Mayıs 2024 GPT-4o: text + image + audio unified. Eylül 2024 Llama-3.2 Vision: open-source. Bu modeller transformer üzerine inşa edildi — multimodal token'lar olarak. CLIP'in (Radford 2021) keşfi: image encoder + text encoder aynı embedding space. Modern VLM: image encoder + projection + LLM. 75 dakika sonra: VLM mimari anatomi, CLIP foundation, GPT-4o ve Llama-3.2 Vision detaylarını kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Pre-VLM era — text-only vs vision models separate
- CLIP (Radford 2021) — image-text alignment
- ViT (Dosovitskiy 2020) — vision transformer
- Image patch embedding — 16x16 patches
- VLM architecture — encoder + projection + LLM
- GPT-4V (Eylül 2023) — OpenAI multimodal
- GPT-4o (Mayıs 2024) — unified multimodal
- Llama-3.2 Vision (Eylül 2024) — open-source
- Türkçe multimodal — TR documents OCR/understanding
- Production deployment — vLLM multimodal
2-5. CLIP + ViT + VLM Architecture#
2.1 CLIP (Radford 2021)#
OpenAI: 'Learning Transferable Visual Models from Natural Language Supervision'.
400M image-text pairs (web crawl). İki encoder:
- Image encoder: ViT or ResNet → image embedding
- Text encoder: transformer → text embedding
Contrastive learning: matching pairs close, non-matching far in shared embedding space.
Loss: contrastive (positive vs negative pairs) Image_emb · Text_emb_match >> Image_emb · Text_emb_random
2.2 CLIP impact#
- Zero-shot image classification
- Foundation for ALL modern VLMs
- DALL-E, Stable Diffusion guidance
2.3 ViT (Vision Transformer)#
Dosovitskiy 2020: 'An Image is Worth 16x16 Words'.
Image 224×224 → 196 patches of 16×16. Each patch flattened → linear projection → patch embedding (like word embedding).
Sequence: 196 patch tokens + [CLS]. Standard transformer applies.
2.4 Image patch embedding math#
Image 224×224×3 channels:
- Patches: 14×14 grid = 196 patches
- Each patch: 16×16×3 = 768 numbers
- Linear projection: 768 → d_model (e.g., 768 for ViT-base)
-
- Positional embedding (2D learnable)
Sequence: [CLS], patch_1, patch_2, ..., patch_196. Length 197.
2.5 Modern VLM architecture#
Image (224×224×3) ↓ [ViT image encoder] → image tokens [N_img, D_vit] ↓ [Projection layer] → image tokens [N_img, D_llm] ↓ ↑ [Concatenate with text] ← Text tokens [N_text, D_llm] ↓ [LLM transformer] ↓ Output text
2.6 Projection layer#
ViT D_vit (e.g., 768) → LLM D_llm (e.g., 4096) için projection.
self.image_projection = nn.Linear(d_vit, d_llm)
Simple linear ok. Or MLP for more capacity.
Crucial: training'de projection layer öğrenilir, image embeddings LLM'in 'language' uzayına projekte edilir.
6-9. GPT-4V, GPT-4o, Llama-3.2 Vision#
6.1 GPT-4V (Eylül 2023)#
OpenAI multimodal. Text + image input → text output.
Use cases: image description, chart analysis, screenshot understanding, visual QA.
Architecture closed-source. Tahmini: CLIP-style image encoder + GPT-4.
6.2 GPT-4o (Mayıs 2024)#
'omni' — text + image + audio unified.
Key innovation: real-time audio (300ms latency).
Text + image + audio embeddings same transformer.
Quality: improved on GPT-4V benchmarks.
6.3 Llama-3.2 Vision (Eylül 2024)#
Meta open-source: 11B + 90B Vision variants.
Architecture:
- Base Llama-3.1 text model (8B / 70B)
- Vision encoder: CLIP-derived (ViT-H/14)
- Cross-attention layers added to Llama (not just concatenation)
- Pre-trained on 6B image-text pairs
Result: Llama-3.2-11B-Vision competitive with GPT-4V on most benchmarks.
6.4 Llama-3.2 Vision usage#
from transformers import MllamaForConditionalGeneration, AutoProcessor import torch from PIL import Image model = MllamaForConditionalGeneration.from_pretrained( "meta-llama/Llama-3.2-11B-Vision-Instruct", torch_dtype=torch.bfloat16, device_map="auto", ) processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct") image = Image.open("turkce_belge.png") messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Bu Türkçe belge ne hakkında? Özetle."}, ]} ] inputs = processor(image, processor.apply_chat_template(messages), return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=500) print(processor.decode(outputs[0]))
6.5 Türkçe multimodal use cases#
- Türkçe document OCR + understanding (kimlik, fatura, sözleşme)
- Türkçe chart/graph reading
- KVKK uyumlu visual data processing
- Türk culture image understanding (food, places, art)
6.6 Production deployment#
vLLM Llama-3.2 Vision support (yarın geliyor):
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-11B-Vision-Instruct \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --port 8000
API OpenAI-compatible vision format. Image base64 or URL.
🎉 Modül 19 Tamamlandı — Multimodal
Vision-Language Models: image encoder (ViT) + projection + LLM. CLIP (Radford 2021) foundation. ViT (Dosovitskiy 2020) image transformer. GPT-4V (Eylül 2023) → GPT-4o (Mayıs 2024) unified → Llama-3.2 Vision (Eylül 2024) open-source. Türkçe multimodal: OCR, documents, charts, culture-specific. Production: vLLM serving. Modül 19 envanteri: 1 ders, 75 dk. Genel müfredat: 20 modül, 91 ders, ~100 saat — 100 saatlik milestone!
Frequently Asked Questions
GPT-4o (API, paid, best quality), Llama-3.2 Vision (open-source, self-host, decent). For Turkish document OCR: both work. Self-host preferred for KVKK compliance.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup