Grounding FT: Bounding-Box Token Format + RefCOCO-Style Task
VLM's 'pointing' capability: 'point to the dog' → [0.32, 0.45, 0.58, 0.71]. Bbox token format: <bbox>x1,y1,x2,y2</bbox> or normalized 0-1000 coordinates. RefCOCO dataset, grounding evaluation (IoU), Qwen 2.5-VL's native grounding support.
Şükrü Yusuf KAYA
24 min read
Advanced1. Bbox Token Format#
Modern VLM'lerin grounding format'ları farklı:
| Model | Format |
|---|---|
| Qwen 2.5-VL | `< |
| LLaVA-NeXT | [x_min, y_min, x_max, y_max] |
| Florence-2 | <loc_x><loc_y><loc_w><loc_h> |
| Llama 3.2 Vision | descriptive text only (no native bbox) |
Qwen 2.5-VL örneği:
User: "Resimde köpeği göster" Assistant: "Resimde bir köpek görüyorum. <|object_ref_start|>köpek<|object_ref_end|> <|box_start|>(123,456),(789,901)<|box_end|>"
✅ Teslim
- Qwen 2.5-VL'in native grounding'ini test et. 2) RefCOCO subset üzerinde FT et. 3) IoU ölç. 4) Sonraki ders: 6.10 — Video LLM FT.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations