VLM Architecture Anatomy: Vision Encoder + Projector + LLM Backbone — Detailed Dissection
VLM's 3 main components: Vision encoder (SigLIP-400M, ViT-G/14, EVA-CLIP), Projector (MLP / Q-former / Resampler / Cross-attention), LLM backbone. Token interleave format, image token allocation, position encoding harmony, 2D/M-RoPE patches. Architecture table for each popular VLM family.
Şükrü Yusuf KAYA
36 min read
Advanced1. VLM'in 3 Bileşeni#
Image → [Vision Encoder] → patch embeddings (e.g. 256 patches × 1024 dim) ↓ [Projector] → image tokens (e.g. 256 × 4096 LLM dim) ↓ Text → [LLM Tokenizer] → text tokens (e.g. 64 × 4096) ↓ [Concat: image_tokens + text_tokens] ↓ [LLM Backbone] → output
Vision Encoder seçenekleri:#
- SigLIP-400M (Google) — modern, en yaygın (LLaVA-OneVision, Idefics3)
- ViT-G/14 (CLIP-style) — eski LLaVA-1.5
- EVA-CLIP (E2-EVA) — InternVL family
- DINOv2 — geometry-aware (bazı niche VLM)
Projector seçenekleri:#
- MLP (basit, en yaygın) — LLaVA-1.5+ default, 2-3 layer
- Q-former (BLIP-2) — learned query'lerle resample (32 query, 32 image token)
- Resampler (Flamingo, Idefics) — cross-attention with learned tokens
- Cross-attention adapter (Llama 3.2 Vision) — interleaved cross-attn layers
2. Popüler VLM Family Arch Tablosu#
| Model | Vision Enc | Projector | LLM Base | Image Tokens | Resolution |
|---|---|---|---|---|---|
| LLaVA-1.5 13B | ViT-L/14 | MLP 2-layer | Vicuna 13B | 576 | 336×336 (fixed) |
| LLaVA-1.6 (LLaVA-NeXT) | ViT-L/14 | MLP 2-layer | Mistral 7B / Llama 3.1 8B | 576-2880 | dynamic 4× |
| LLaVA-OneVision | SigLIP | MLP | Qwen 2.5 7B | 729 | dynamic |
| Llama 3.2 11B Vision | ViT-H/14 | Cross-attn adapter | Llama 3.1 8B | implicit (cross-attn) | up to 1120×1120 |
| Llama 3.2 90B Vision | ViT-H/14 | Cross-attn | Llama 3.1 70B | implicit | 1120×1120 |
| Qwen 2.5-VL 7B | ViT (Qwen native) | MLP + M-RoPE | Qwen 2.5 7B | up to 8K dynamic | resolution-free |
| Pixtral 12B | ViT (Pixtral native) | MLP | Mistral Nemo 12B | up to 4096 | resolution-free |
| InternVL2.5 8B | InternViT-6B | MLP | InternLM2.5 7B | 256 per tile | dynamic |
| Phi-4-Multimodal | SigLIP | MLP + LoRA | Phi-4-mini 3.8B | 1024 | dynamic |
Karar: Modern VLM FT için Qwen 2.5-VL 7B baseline (dynamic resolution + multilingual + Apache 2.0).
✅ Teslim
- 3-4 VLM'in model card'larını oku. 2) Arch tablosunu kendin doldur (HF config'lerden). 3) Sonraki ders: 6.2 — LLaVA Family Fine-Tuning.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations