When to Freeze the Vision Tower? — Probing Lab + Downstream Eval

VLM FT's most debated decision: freeze the vision encoder or not? Frozen → vision capability preserved, training fast, less risk. Unfrozen → +2-5% quality but 3-5x slower training + overfit risk. Ablation: 5 freeze strategies comparison, RTX 4090 + Qwen 2.5-VL 7B.

Şükrü Yusuf KAYA

26 min read

6/25/2026

Advanced

Vision Tower'ı Hangi Aşamada Freeze? — Probing Lab + Downstream Eval

1. 5 Freeze Stratejisi#

Strateji	Trainable	Trade-off
(a) Full Frozen	sadece projector + LLM LoRA	en hızlı, en az risk
(b) Last-layer Unfrozen	+ son ViT layer	hafif fine-tuning vision
(c) Last 6 layers Unfrozen	+ son 6 ViT layer	orta adaptation
(d) Full Unfrozen	tüm ViT + projector + LLM	en pahalı, en agresif
(e) Vision LoRA	ViT'e LoRA r=8	balance

Bench (Qwen 2.5-VL 7B + RTX 4090 + 5K TR-VQA):#

Strateji	DocVQA acc	OCR-TR acc	Wall-clock	Peak GB
(a) Full frozen	78.4%	82.1%	4h	14.2
(b) Last-layer	79.2%	83.5%	5h	15.8
(c) Last-6 layers	80.1%	84.6%	7h	18.4
(d) Full unfrozen	80.3%	85.1%	14h	23.5 (gergin)
(e) Vision LoRA r=8	79.6%	83.9%	5h	15.4

Karar: (c) son 6 layer unfrozen — sweet spot. Bütçe sıkıysa (a) ya da (e).

✅ Teslim

Aynı dataset ile (a) ve (c) stratejilerini koş. 2) DocVQA accuracy farkını ölç. 3) Sonraki ders: 6.8 — Document VLM FT (DocVQA/ChartQA + TR).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

When to Freeze the Vision Tower? — Probing Lab + Downstream Eval

1. 5 Freeze Stratejisi#

Bench (Qwen 2.5-VL 7B + RTX 4090 + 5K TR-VQA):#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter