Skip to content

Qwen 2.5-VL: Dynamic Resolution + M-RoPE + Turkish OCR FT (Invoice/Petition)

Qwen 2.5-VL (3B/7B/72B) — modern multimodal champion. **Dynamic resolution** (no 224×224 fixed), **M-RoPE** (temporal + height + width), document understanding, video, multilingual. End-to-end Turkish invoice/petition OCR FT: dataset prep, vision tower freeze, LoRA target, accuracy measurement.

Şükrü Yusuf KAYA
38 min read
Advanced
Qwen 2.5-VL: Dynamic Resolution + M-RoPE + Türkçe OCR FT (Fatura/Dilekçe)

1. Qwen 2.5-VL Mimari Özellikler#

AspectDetay
Vision encoderQwen native ViT (672M params)
ResolutionDynamic — herhangi bir resolution kabul eder
Image tokensResolution × 1 token / 28×28 patch (örn. 1024×1024 → 1296 patches)
Position encodingM-RoPE (multi-axis)
Video supportyes (frame sequence)
LanguagesTR/EN/ZH + 30+
Long-context128K
M-RoPE detay: Klasik RoPE 1D. M-RoPE 3D:
  • Temporal (video frame index)
  • Height (image y-coordinate)
  • Width (image x-coordinate)
Bu sayede spatial reasoning daha kuvvetli.
python
# === Türkçe Fatura OCR FT — Qwen 2.5-VL 7B + RTX 4090 ===
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
import torch
 
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16)
 
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
quantization_config=bnb,
torch_dtype=torch.bfloat16,
device_map="cuda",
min_pixels=256*28*28,
max_pixels=1280*28*28, # ~1280 max patches
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
 
# Vision tower'ı freeze, sadece LLM + projector train
for name, param in model.named_parameters():
if "visual" in name and "merger" not in name:
param.requires_grad = False
 
lora = LoraConfig(
r=32, lora_alpha=64, lora_dropout=0.05,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
 
# Türkçe fatura dataset — örnek format
# {"image": <PIL>, "extracted_fields": {"vergi_no": "...", "tutar": "...", ...}}
dataset = load_dataset("user/turkish-invoices", split="train")
 
def format_invoice(example):
fields = example["extracted_fields"]
answer = f"Vergi No: {fields['vergi_no']}\nTutar: {fields['tutar']} TL\nFatura Tarihi: {fields['tarih']}"
messages = [
{"role": "user", "content": [
{"type": "image", "image": example["image"]},
{"type": "text", "text": "Bu Türkçe faturadan vergi numarası, tutar ve tarih bilgilerini çıkar."},
]},
{"role": "assistant", "content": answer},
]
return processor.apply_chat_template(messages, tokenize=False)
 
# Train (~8 saat 1000 fatura, RTX 4090)
cfg = SFTConfig(
output_dir="qwen-2.5-vl-tr-invoice",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=1e-4,
bf16=True, optim="paged_adamw_8bit",
max_seq_length=8192,
logging_steps=5, report_to="wandb",
)
 
# Bench:
# Base Qwen 2.5-VL field extraction accuracy: ~76%
# After FT (1000 fatura): ~94%
Türkçe fatura OCR — Qwen 2.5-VL 7B FT
✅ Teslim
  1. Açık TR fatura/dilekçe dataset bulup (veya synthetic generate) FT et. 2) Field extraction accuracy ölç. 3) Sonraki ders: 6.5 — Pixtral 12B.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content