Qwen 2.5-VL: Dynamic Resolution + M-RoPE + Turkish OCR FT (Invoice/Petition)
Qwen 2.5-VL (3B/7B/72B) — modern multimodal champion. **Dynamic resolution** (no 224×224 fixed), **M-RoPE** (temporal + height + width), document understanding, video, multilingual. End-to-end Turkish invoice/petition OCR FT: dataset prep, vision tower freeze, LoRA target, accuracy measurement.
Şükrü Yusuf KAYA
38 min read
Advanced1. Qwen 2.5-VL Mimari Özellikler#
| Aspect | Detay |
|---|---|
| Vision encoder | Qwen native ViT (672M params) |
| Resolution | Dynamic — herhangi bir resolution kabul eder |
| Image tokens | Resolution × 1 token / 28×28 patch (örn. 1024×1024 → 1296 patches) |
| Position encoding | M-RoPE (multi-axis) |
| Video support | yes (frame sequence) |
| Languages | TR/EN/ZH + 30+ |
| Long-context | 128K |
M-RoPE detay: Klasik RoPE 1D. M-RoPE 3D:
- Temporal (video frame index)
- Height (image y-coordinate)
- Width (image x-coordinate)
Bu sayede spatial reasoning daha kuvvetli.
python
# === Türkçe Fatura OCR FT — Qwen 2.5-VL 7B + RTX 4090 ===from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfigfrom peft import LoraConfig, get_peft_modelfrom datasets import load_datasetfrom trl import SFTTrainer, SFTConfigimport torch bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-VL-7B-Instruct", quantization_config=bnb, torch_dtype=torch.bfloat16, device_map="cuda", min_pixels=256*28*28, max_pixels=1280*28*28, # ~1280 max patches)processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct") # Vision tower'ı freeze, sadece LLM + projector trainfor name, param in model.named_parameters(): if "visual" in name and "merger" not in name: param.requires_grad = False lora = LoraConfig( r=32, lora_alpha=64, lora_dropout=0.05, target_modules=["q_proj","k_proj","v_proj","o_proj", "gate_proj","up_proj","down_proj"], task_type="CAUSAL_LM",)model = get_peft_model(model, lora) # Türkçe fatura dataset — örnek format# {"image": <PIL>, "extracted_fields": {"vergi_no": "...", "tutar": "...", ...}}dataset = load_dataset("user/turkish-invoices", split="train") def format_invoice(example): fields = example["extracted_fields"] answer = f"Vergi No: {fields['vergi_no']}\nTutar: {fields['tutar']} TL\nFatura Tarihi: {fields['tarih']}" messages = [ {"role": "user", "content": [ {"type": "image", "image": example["image"]}, {"type": "text", "text": "Bu Türkçe faturadan vergi numarası, tutar ve tarih bilgilerini çıkar."}, ]}, {"role": "assistant", "content": answer}, ] return processor.apply_chat_template(messages, tokenize=False) # Train (~8 saat 1000 fatura, RTX 4090)cfg = SFTConfig( output_dir="qwen-2.5-vl-tr-invoice", per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=1e-4, bf16=True, optim="paged_adamw_8bit", max_seq_length=8192, logging_steps=5, report_to="wandb",) # Bench:# Base Qwen 2.5-VL field extraction accuracy: ~76%# After FT (1000 fatura): ~94%Türkçe fatura OCR — Qwen 2.5-VL 7B FT
✅ Teslim
- Açık TR fatura/dilekçe dataset bulup (veya synthetic generate) FT et. 2) Field extraction accuracy ölç. 3) Sonraki ders: 6.5 — Pixtral 12B.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations