Grounding FT: Bounding-Box Token Format + RefCOCO-Style Task

VLM's 'pointing' capability: 'point to the dog' → [0.32, 0.45, 0.58, 0.71]. Bbox token format: <bbox>x1,y1,x2,y2</bbox> or normalized 0-1000 coordinates. RefCOCO dataset, grounding evaluation (IoU), Qwen 2.5-VL's native grounding support.

Şükrü Yusuf KAYA

24 min read

6/26/2026

Advanced

Grounding FT: Bounding-Box Token Format + RefCOCO-Tarzı Görev

1. Bbox Token Format#

Modern VLM'lerin grounding format'ları farklı:

Model	Format
Qwen 2.5-VL	`<
LLaVA-NeXT	`[x_min, y_min, x_max, y_max]` (0-1 normalize)
Florence-2	`<loc_x><loc_y><loc_w><loc_h>` (0-1000 quantize)
Llama 3.2 Vision	descriptive text only (no native bbox)

Qwen 2.5-VL örneği:

User: "Resimde köpeği göster"
Assistant: "Resimde bir köpek görüyorum.
<|object_ref_start|>köpek<|object_ref_end|>
<|box_start|>(123,456),(789,901)<|box_end|>"

✅ Teslim

Qwen 2.5-VL'in native grounding'ini test et. 2) RefCOCO subset üzerinde FT et. 3) IoU ölç. 4) Sonraki ders: 6.10 — Video LLM FT.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Grounding FT: Bounding-Box Token Format + RefCOCO-Style Task

1. Bbox Token Format#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter