Skip to content

When to Freeze the Vision Tower? — Probing Lab + Downstream Eval

VLM FT's most debated decision: freeze the vision encoder or not? Frozen → vision capability preserved, training fast, less risk. Unfrozen → +2-5% quality but 3-5x slower training + overfit risk. Ablation: 5 freeze strategies comparison, RTX 4090 + Qwen 2.5-VL 7B.

Şükrü Yusuf KAYA
26 min read
Advanced
Vision Tower'ı Hangi Aşamada Freeze? — Probing Lab + Downstream Eval

1. 5 Freeze Stratejisi#

StratejiTrainableTrade-off
(a) Full Frozensadece projector + LLM LoRAen hızlı, en az risk
(b) Last-layer Unfrozen+ son ViT layerhafif fine-tuning vision
(c) Last 6 layers Unfrozen+ son 6 ViT layerorta adaptation
(d) Full Unfrozentüm ViT + projector + LLMen pahalı, en agresif
(e) Vision LoRAViT'e LoRA r=8balance

Bench (Qwen 2.5-VL 7B + RTX 4090 + 5K TR-VQA):#

StratejiDocVQA accOCR-TR accWall-clockPeak GB
(a) Full frozen78.4%82.1%4h14.2
(b) Last-layer79.2%83.5%5h15.8
(c) Last-6 layers80.1%84.6%7h18.4
(d) Full unfrozen80.3%85.1%14h23.5 (gergin)
(e) Vision LoRA r=879.6%83.9%5h15.4
Karar: (c) son 6 layer unfrozen — sweet spot. Bütçe sıkıysa (a) ya da (e).
✅ Teslim
  1. Aynı dataset ile (a) ve (c) stratejilerini koş. 2) DocVQA accuracy farkını ölç. 3) Sonraki ders: 6.8 — Document VLM FT (DocVQA/ChartQA + TR).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content