Why is LLaVA linear projection still popular when more advanced approaches exist?

Multimodal Architecture Mathematics: Vision Encoder → Projection → LLM — 3 Connection Strategies

Internal architectural mathematics of multimodal LLMs: 3 strategies for Vision encoder (ViT/CLIP/SigLIP) → projection → LLM binding. (1) Linear projection (LLaVA style, simple), (2) Q-Former (BLIP-2 style, learnable queries), (3) Cross-attention (Flamingo/Llama-3.2 style, deep integration). Image token budget management, resolution problem, vision-text alignment. LLaVA-style multimodal architecture in PyTorch from scratch. Image-text alignment for Turkish.

Şükrü Yusuf KAYA

85 min read

5/13/2026

Advanced

Multimodal Mimari Matematiği: Vision Encoder → Projection → LLM — 3 Bağlama Stratejisi

🏗️ Vision'ı LLM'e Bağlamanın 3 Yolu

Multimodal LLM'in temel sorusu: 'Resim bilgisini transformer dil modeline nasıl ekleyeceğiz?' Pre-trained image encoder (CLIP, SigLIP) çıktısı continuous vector'lar. LLM discrete token'lar bekliyor. Bu uçurumu nasıl köprüleriz?

5 yıllık araştırma 3 ana cevap üretti:

(1) Linear Projection (LLaVA tarzı): basit, doğrudan, etkili. CLIP çıktısını lineer projeksiyon ile LLM embedding boyutuna çevir, normal token gibi sok.

(2) Q-Former (BLIP-2 tarzı): learnable 'query' vektörleri ile resim bilgisini özetleyen ek modül. Daha az image token, daha verimli.

(3) Cross-Attention (Flamingo / Llama-3.2 tarzı): LLM içine ekstra attention layer'ları ekle. Derin entegrasyon, en güçlü ama en karmaşık.

Her biri farklı trade-off'ları temsil ediyor. Bu ders her birinin matematiksel anatomi'sini, PyTorch implementasyonunu, performans karşılaştırmasını işliyor. 85 dakika sonra: kendin bir multimodal LLM mimarisi seçebilir, implement edebilir, debug edebilirsin.

Bu Derste Neler Var? (12 Bölüm)#

3 strateji genel bakış — trade-off'lar
Strateji 1: Linear Projection — LLaVA matematiği
Linear Projection PyTorch
Strateji 2: Q-Former — BLIP-2 matematiği
Learnable queries fikri
Strateji 3: Cross-Attention — Flamingo matematiği
Llama-3.2 Vision'ın cross-attention kullanımı
3 stratejinin empirik karşılaştırması
Image token budget management — 256 mı 576 mı 2304 mü?
Yüksek çözünürlük resimler — patches yaklaşımı
Türkçe için vision-text alignment
Egzersizler

2-3. Strateji 1: Linear Projection (LLaVA)#

2.1 Matematik#

En basit yaklaşım. 3 adım:

1. Image (224×224×3) → CLIP-ViT-L encoder → vision features ∈ ℝ^{576 × 1024}
   (576 patches × 1024 hidden dim)

2. Linear projection: W ∈ ℝ^{1024 × 4096} (LLM embedding dim)
   image_tokens = vision_features @ W  ∈ ℝ^{576 × 4096}

3. Concatenate with text tokens:
   full_sequence = [<image_start>, image_tokens_1, ..., image_tokens_576, <image_end>, text_tokens]
   
4. LLM (Llama, Mistral) standard forward pass

İmage = 576 token gibi davranıyor. Geri kalan tamamen normal LLM.

2.2 LLaVA-1.5 spesifik#

MLP projection (basit linear yerine):

self.mm_projector = nn.Sequential(
    nn.Linear(1024, 4096),
    nn.GELU(),
    nn.Linear(4096, 4096),
)

İki linear + nonlinearity, daha güçlü adaptation.

2.3 Avantajları#

Basit: 50 satır kod
Pre-trained LLM'i bozmaz: LLM ağırlıkları aynı kalır, sadece projection eğitilir
Hızlı eğitim: projection sadece milyonlarca parametre
Modüler: yeni bir LLM çıkarsa, projection re-train kolay

2.4 Dezavantajları#

Token budget büyük: 576 image token = LLM context'inin önemli kısmı
'Surface alignment': derin görsel-dilsel bağlantı yok, sadece projection
Yüksek çözünürlük zor: 1024×1024 resim için 4096 patch → çok token

2.5 LLaVA empirik#

LLaVA-1.5-13B + Vicuna LLM:

VQAv2 (Visual Question Answering): %80.0
GQA (chart understanding): %63.3
LLaVA-Bench: %78.5 (GPT-4'ün %85'i)

GPT-4V'den geride ama 7B-13B param ile mucize.

python

# LLaVA-tarzı Multimodal LLM — Sıfırdan PyTorch
import torch
import torch.nn as nn
from transformers import CLIPVisionModel, AutoModelForCausalLM, AutoTokenizer
 
class LLaVAStyleModel(nn.Module):
    def __init__(
        self,
        vision_model_name='openai/clip-vit-large-patch14-336',
        llm_name='meta-llama/Llama-3-8B-Instruct',
    ):
        super().__init__()
        
        # 1. Vision encoder (frozen)
        self.vision_encoder = CLIPVisionModel.from_pretrained(vision_model_name)
        for p in self.vision_encoder.parameters():
            p.requires_grad = False
        
        vision_hidden = self.vision_encoder.config.hidden_size  # 1024
        
        # 2. LLM (kısmen frozen)
        self.llm = AutoModelForCausalLM.from_pretrained(llm_name, torch_dtype=torch.bfloat16)
        self.tokenizer = AutoTokenizer.from_pretrained(llm_name)
        llm_hidden = self.llm.config.hidden_size  # 4096
        
        # 3. Projection (only trainable in stage 1)
        self.projector = nn.Sequential(
            nn.Linear(vision_hidden, llm_hidden),
            nn.GELU(),
            nn.Linear(llm_hidden, llm_hidden),
        )
        
        # Special tokens for image markers
        self.image_start_token = self.tokenizer.convert_tokens_to_ids('<image_start>')
        self.image_end_token = self.tokenizer.convert_tokens_to_ids('<image_end>')
    
    def encode_image(self, image_tensor):
        """image: [B, 3, 336, 336] → image_tokens: [B, 576, 4096]"""
        with torch.no_grad():
            vision_outputs = self.vision_encoder(image_tensor)
            # last_hidden_state: [B, 577, 1024] (577 = 576 patches + 1 CLS)
            vision_features = vision_outputs.last_hidden_state[:, 1:, :]  # drop CLS
        
        image_tokens = self.projector(vision_features)
        return image_tokens  # [B, 576, 4096]
    
    def forward(self, image, text_prompt):
        # 1. Encode image
        image_tokens = self.encode_image(image)  # [B, 576, 4096]
        
        # 2. Tokenize text
        text_input = self.tokenizer(text_prompt, return_tensors='pt')
        text_input_ids = text_input.input_ids.to(image.device)
        text_embeds = self.llm.model.embed_tokens(text_input_ids)  # [B, L_text, 4096]
        
        # 3. Image start/end markers
        start_embed = self.llm.model.embed_tokens(torch.tensor([[self.image_start_token]], device=image.device))
        end_embed = self.llm.model.embed_tokens(torch.tensor([[self.image_end_token]], device=image.device))
        
        # 4. Concatenate: [<image_start>, image_tokens, <image_end>, text]
        full_embeds = torch.cat([
            start_embed,         # [B, 1, 4096]
            image_tokens,        # [B, 576, 4096]
            end_embed,           # [B, 1, 4096]
            text_embeds,         # [B, L_text, 4096]
        ], dim=1)
        
        # 5. LLM forward pass
        outputs = self.llm(inputs_embeds=full_embeds)
        return outputs
 
# Kullanım
model = LLaVAStyleModel()
image = torch.randn(1, 3, 336, 336)
response = model(image, 'Bu resimde ne görüyorsun?')
print(response.logits.shape)  # [1, ~600, 128256]
 
# Eğitim — sadece projector trainable (stage 1: pre-training)
for p in model.parameters():
    p.requires_grad = False
for p in model.projector.parameters():
    p.requires_grad = True
 
# Sonra stage 2: LLM + projector birlikte fine-tune (instruction tuning)

LLaVA-Tarzı Multimodal LLM PyTorch (Türkçe yorumlanmış)

4-7. Q-Former + Cross-Attention#

4.1 Strateji 2: Q-Former (BLIP-2)#

Problem: LLaVA'da 576 image token LLM context'ini doluyor. Daha az ama 'bilgi yoğun' token üretmek mümkün mü?

Çözüm (Li 2023, BLIP-2): Q-Former modülü. Learnable query vectors (32 typical) image features'tan bilgiyi 'sorgular ve özetler'.

Mimari:

Image → CLIP-ViT → vision features (576 × 1024)
        ↓
        Q-Former (32 learnable queries × 768)
          ↓ Cross-attention with vision features
          ↓ Self-attention among queries
          ↓ FFN
        ↓
        32 'visual tokens' (instead of 576)
        ↓ Linear projection
        ↓
        32 tokens for LLM

4.2 Q-Former matematik#

Q = learnable_queries  # [32, 768]
K, V = vision_features  # [576, 1024]

# Cross-attention
Q_KV = MultiHeadAttention(Q, K, V)  # [32, 768]
# 32 query'nin her biri 576 patch'i 'soruyor'

Q_self = MultiHeadAttention(Q_KV, Q_KV, Q_KV)  # [32, 768]
Q_out = FFN(Q_self)  # [32, 768]

Final: project to LLM dim
image_tokens = Q_out @ W_proj  # [32, 4096]

Avantaj: 576 → 32 token. LLM context'inin %5'i (önceden %15'i). Dezavantaj: Q-Former'in kendisi büyük (110M param). Eğitim daha karmaşık.

4.3 BLIP-2 empirik#

BLIP-2 (Flan-T5-XL + Q-Former + ViT):

VQAv2: %65 (LLaVA'dan biraz az)
Visual reasoning: comparable
Inference daha hızlı (32 token vs 576)

Production'da BLIP-2 niche kullanılıyor. LLaVA daha popüler basitliği nedeniyle.

6.1 Strateji 3: Cross-Attention (Flamingo / Llama-3.2)#

En 'derin' yaklaşım. Vision bilgisini LLM layer'larının ortasında enjekte et.

Flamingo mimarisi:

LLM transformer layer'ları arasına yeni 'cross-attention' layer'lar yerleştir.
Each cross-attn layer: image features'a attend.

Layer N (LLM): Self-attention + FFN
Layer N.5 (Vision injection): Cross-attention to image features
Layer N+1 (LLM): Self-attention + FFN
Layer N+1.5 (Vision injection): Cross-attention to image features
...

Matematik (basitleştirilmiş):

class FlamingoBlock(nn.Module):
    def forward(self, hidden_states, image_features):
        # LLM self-attention
        hidden_states = hidden_states + self_attn(layer_norm(hidden_states))
        
        # Vision cross-attention (yeni layer)
        hidden_states = hidden_states + cross_attn(
            query=hidden_states,
            key=image_features,
            value=image_features,
        )
        
        # FFN
        hidden_states = hidden_states + ffn(layer_norm(hidden_states))
        return hidden_states

7.1 Llama-3.2 Vision#

Llama-3.2-11B-Vision ve 90B-Vision Flamingo benzeri kullanıyor:

Pre-trained Llama-3.1 (text) + image encoder
Llama'nın her 4 layer'ında bir cross-attention insert
Cross-attention layer'lar trainable, Llama base frozen

Avantaj: en güçlü integration, kalite yüksek Dezavantaj: implementation karmaşık, base LLM değiştirilmesi zor

8.1 3 stratejinin karşılaştırması#

Özellik	Linear (LLaVA)	Q-Former (BLIP-2)	Cross-Attn (Flamingo/L3.2)
Implementation karmaşıklığı	Basit	Orta	Karmaşık
Trainable parametre	Az (~10M)	Orta (~150M)	Çok (~1B)
Image token sayısı	576	32	0 (cross-attn)
Quality	İyi	Orta	En iyi
Inference hızı	Yavaş (576 tok)	Hızlı (32 tok)	Hızlı (cross-attn)
Yeni LLM'e adaptation	Kolay	Orta	Zor

Production tercihi:

Hızlı prototip: LLaVA (Linear)
Inference verimli: BLIP-2 (Q-Former)
En kaliteli: Flamingo / Llama-3.2 (Cross-attention)

9.1 Image token budget management#

Çoğu LLM context window 128K. Image 576 token kaplıyor. Pratik etki:

128K context - 576 image tokens = 127K text için
Genelde önemsiz (%0.5)

Ama yüksek çözünürlük resim (1024×1024) → 4096 patches → 4096 token. Bu artık önemli:

128K - 4096 = 124K text. %3 kayıp.

10.1 Yüksek çözünürlük strateji#

LLaVA-Next (2024): resmi 4 quadrant'a böl, her birini ayrı encode et + global low-res versiyonu. Toplam: 576 × 4 + 576 = 2880 token (yine de manageable).

Qwen2-VL: dynamic resolution, resim boyutuna göre patch sayısı.

✅ Ders 19.2 Özeti — Multimodal Mimari Matematiği

Vision → LLM bağlama 3 strateji: Linear Projection (LLaVA, basit, 576 token, en yaygın), Q-Former (BLIP-2, 32 token, daha verimli ama implementation karmaşık), Cross-Attention (Flamingo / Llama-3.2 Vision, en derin integration, en yüksek kalite). Production tercihi use-case'e bağlı: hızlı prototip → Linear, inference verimli → Q-Former, en kaliteli → Cross-Attn. Image token budget = 576 (default), yüksek çözünürlük için LLaVA-Next 2880. PyTorch'ta LLaVA-tarzı sıfırdan implement edilebilir (~100 satır). Sonraki ders: Türkçe multimodal pratiği — kimlik OCR, e-fatura, doküman processing.

Sonraki Ders: Türkçe Multimodal Pratiği#

Ders 19.3'te production Türkçe multimodal use cases: kimlik kartı OCR + alan çıkarma, e-fatura ve makbuz processing, Türkçe trafik işaretleri tanıma, Türkçe sınav kağıdı dijitalleştirme, Osmanlıca belge analizi. GPT-4o + Llama-3.2-Vision API kullanım pattern'ları. KVKK uyumlu pipeline. Production-grade örnekler.

Frequently Asked Questions

**Pareto-optimal point**: **Advantage 1: Simplicity**: 50 lines of code, easy to debug, simple to understand. **Advantage 2: Modularity**: If you want to switch LLM, just re-train projection. In cross-attention everything trained together, swap is hard. **Advantage 3: Open-source community**: LLaVA has thousands of derivative models (LLaVA-Med, LLaVA-Plus, ShareGPT4V). Q-Former has far fewer. **Advantage 4: Fast port to new LLMs**: If new LLM (Llama-4 etc.) comes, LLaVA approach ports in 1 day. Cross-attn takes 1 month. **Quality loss tolerable**: %5-10 quality loss, worth the simplicity. In production: research/prototype → LLaVA, frontier quality → cross-attention.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...