VLM Mimari Anatomi: Vision Encoder + Projector + LLM Backbone

VLM Architecture Anatomy: Vision Encoder + Projector + LLM Backbone — Detailed Dissection

VLM's 3 main components: Vision encoder (SigLIP-400M, ViT-G/14, EVA-CLIP), Projector (MLP / Q-former / Resampler / Cross-attention), LLM backbone. Token interleave format, image token allocation, position encoding harmony, 2D/M-RoPE patches. Architecture table for each popular VLM family.

Şükrü Yusuf KAYA

36 min read

5/14/2026

Advanced

1. VLM'in 3 Bileşeni#

Image → [Vision Encoder] → patch embeddings (e.g. 256 patches × 1024 dim)
                           ↓
                           [Projector] → image tokens (e.g. 256 × 4096 LLM dim)
                                          ↓
Text → [LLM Tokenizer] → text tokens (e.g. 64 × 4096)
                          ↓
                          [Concat: image_tokens + text_tokens]
                           ↓
                           [LLM Backbone] → output

Vision Encoder seçenekleri:#

SigLIP-400M (Google) — modern, en yaygın (LLaVA-OneVision, Idefics3)
ViT-G/14 (CLIP-style) — eski LLaVA-1.5
EVA-CLIP (E2-EVA) — InternVL family
DINOv2 — geometry-aware (bazı niche VLM)

Projector seçenekleri:#

MLP (basit, en yaygın) — LLaVA-1.5+ default, 2-3 layer
Q-former (BLIP-2) — learned query'lerle resample (32 query, 32 image token)
Resampler (Flamingo, Idefics) — cross-attention with learned tokens
Cross-attention adapter (Llama 3.2 Vision) — interleaved cross-attn layers

2. Popüler VLM Family Arch Tablosu#

Model	Vision Enc	Projector	LLM Base	Image Tokens	Resolution
LLaVA-1.5 13B	ViT-L/14	MLP 2-layer	Vicuna 13B	576	336×336 (fixed)
LLaVA-1.6 (LLaVA-NeXT)	ViT-L/14	MLP 2-layer	Mistral 7B / Llama 3.1 8B	576-2880	dynamic 4×
LLaVA-OneVision	SigLIP	MLP	Qwen 2.5 7B	729	dynamic
Llama 3.2 11B Vision	ViT-H/14	Cross-attn adapter	Llama 3.1 8B	implicit (cross-attn)	up to 1120×1120
Llama 3.2 90B Vision	ViT-H/14	Cross-attn	Llama 3.1 70B	implicit	1120×1120
Qwen 2.5-VL 7B	ViT (Qwen native)	MLP + M-RoPE	Qwen 2.5 7B	up to 8K dynamic	resolution-free
Pixtral 12B	ViT (Pixtral native)	MLP	Mistral Nemo 12B	up to 4096	resolution-free
InternVL2.5 8B	InternViT-6B	MLP	InternLM2.5 7B	256 per tile	dynamic
Phi-4-Multimodal	SigLIP	MLP + LoRA	Phi-4-mini 3.8B	1024	dynamic

Karar: Modern VLM FT için Qwen 2.5-VL 7B baseline (dynamic resolution + multilingual + Apache 2.0).

VLM Architecture Anatomy: Vision Encoder + Projector + LLM Backbone — Detailed Dissection

1. VLM'in 3 Bileşeni#

Vision Encoder seçenekleri:#

Projector seçenekleri:#

2. Popüler VLM Family Arch Tablosu#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter