When to Freeze the Vision Tower? — Probing Lab + Downstream Eval
VLM FT's most debated decision: freeze the vision encoder or not? Frozen → vision capability preserved, training fast, less risk. Unfrozen → +2-5% quality but 3-5x slower training + overfit risk. Ablation: 5 freeze strategies comparison, RTX 4090 + Qwen 2.5-VL 7B.
Şükrü Yusuf KAYA
26 min read
Advanced1. 5 Freeze Stratejisi#
| Strateji | Trainable | Trade-off |
|---|---|---|
| (a) Full Frozen | sadece projector + LLM LoRA | en hızlı, en az risk |
| (b) Last-layer Unfrozen | + son ViT layer | hafif fine-tuning vision |
| (c) Last 6 layers Unfrozen | + son 6 ViT layer | orta adaptation |
| (d) Full Unfrozen | tüm ViT + projector + LLM | en pahalı, en agresif |
| (e) Vision LoRA | ViT'e LoRA r=8 | balance |
Bench (Qwen 2.5-VL 7B + RTX 4090 + 5K TR-VQA):#
| Strateji | DocVQA acc | OCR-TR acc | Wall-clock | Peak GB |
|---|---|---|---|---|
| (a) Full frozen | 78.4% | 82.1% | 4h | 14.2 |
| (b) Last-layer | 79.2% | 83.5% | 5h | 15.8 |
| (c) Last-6 layers | 80.1% | 84.6% | 7h | 18.4 |
| (d) Full unfrozen | 80.3% | 85.1% | 14h | 23.5 (gergin) |
| (e) Vision LoRA r=8 | 79.6% | 83.9% | 5h | 15.4 |
Karar: (c) son 6 layer unfrozen — sweet spot. Bütçe sıkıysa (a) ya da (e).
✅ Teslim
- Aynı dataset ile (a) ve (c) stratejilerini koş. 2) DocVQA accuracy farkını ölç. 3) Sonraki ders: 6.8 — Document VLM FT (DocVQA/ChartQA + TR).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations