# VLM Architecture Anatomy: Vision Encoder + Projector + LLM Backbone — Detailed Dissection

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-vlm-architecture-anatomy
> Updated: 2026-05-14T14:42:53.816Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part VI — Vision-Language Multimodal FT
**TLDR:** VLM's 3 main components: Vision encoder (SigLIP-400M, ViT-G/14, EVA-CLIP), Projector (MLP / Q-former / Resampler / Cross-attention), LLM backbone. Token interleave format, image token allocation, position encoding harmony, 2D/M-RoPE patches. Architecture table for each popular VLM family.

