Skip to content

VLM Architecture Anatomy: Vision Encoder + Projector + LLM Backbone — Detailed Dissection

VLM's 3 main components: Vision encoder (SigLIP-400M, ViT-G/14, EVA-CLIP), Projector (MLP / Q-former / Resampler / Cross-attention), LLM backbone. Token interleave format, image token allocation, position encoding harmony, 2D/M-RoPE patches. Architecture table for each popular VLM family.

Şükrü Yusuf KAYA
36 min read
Advanced
VLM Mimari Anatomisi: Vision Encoder + Projector + LLM Backbone — Detaylı Diseksiyon

1. VLM'in 3 Bileşeni#

Image → [Vision Encoder] → patch embeddings (e.g. 256 patches × 1024 dim) ↓ [Projector] → image tokens (e.g. 256 × 4096 LLM dim) ↓ Text → [LLM Tokenizer] → text tokens (e.g. 64 × 4096) ↓ [Concat: image_tokens + text_tokens] ↓ [LLM Backbone] → output

Vision Encoder seçenekleri:#

  • SigLIP-400M (Google) — modern, en yaygın (LLaVA-OneVision, Idefics3)
  • ViT-G/14 (CLIP-style) — eski LLaVA-1.5
  • EVA-CLIP (E2-EVA) — InternVL family
  • DINOv2 — geometry-aware (bazı niche VLM)

Projector seçenekleri:#

  • MLP (basit, en yaygın) — LLaVA-1.5+ default, 2-3 layer
  • Q-former (BLIP-2) — learned query'lerle resample (32 query, 32 image token)
  • Resampler (Flamingo, Idefics) — cross-attention with learned tokens
  • Cross-attention adapter (Llama 3.2 Vision) — interleaved cross-attn layers

2. Popüler VLM Family Arch Tablosu#

ModelVision EncProjectorLLM BaseImage TokensResolution
LLaVA-1.5 13BViT-L/14MLP 2-layerVicuna 13B576336×336 (fixed)
LLaVA-1.6 (LLaVA-NeXT)ViT-L/14MLP 2-layerMistral 7B / Llama 3.1 8B576-2880dynamic 4×
LLaVA-OneVisionSigLIPMLPQwen 2.5 7B729dynamic
Llama 3.2 11B VisionViT-H/14Cross-attn adapterLlama 3.1 8Bimplicit (cross-attn)up to 1120×1120
Llama 3.2 90B VisionViT-H/14Cross-attnLlama 3.1 70Bimplicit1120×1120
Qwen 2.5-VL 7BViT (Qwen native)MLP + M-RoPEQwen 2.5 7Bup to 8K dynamicresolution-free
Pixtral 12BViT (Pixtral native)MLPMistral Nemo 12Bup to 4096resolution-free
InternVL2.5 8BInternViT-6BMLPInternLM2.5 7B256 per tiledynamic
Phi-4-MultimodalSigLIPMLP + LoRAPhi-4-mini 3.8B1024dynamic
Karar: Modern VLM FT için Qwen 2.5-VL 7B baseline (dynamic resolution + multilingual + Apache 2.0).
✅ Teslim
  1. 3-4 VLM'in model card'larını oku. 2) Arch tablosunu kendin doldur (HF config'lerden). 3) Sonraki ders: 6.2 — LLaVA Family Fine-Tuning.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content