Video LLM FT: LLaVA-NeXT-Video + VideoLLaMA3 + Frame Sampling Stratejisi

Video LLM'i — image'in temporal extension'ı. LLaVA-NeXT-Video, VideoLLaMA3, Qwen 2.5-VL native video. Frame sampling (uniform vs adaptive), temporal token compression, long-video Q&A (>1 saat). RTX 4090'da Video LLM FT — short-clip (10-30 sn) ile pratik.

Şükrü Yusuf KAYA

26 dakikalık okuma

26.06.2026

İleri

Video LLM FT: LLaVA-NeXT-Video + VideoLLaMA3 + Frame Sampling Stratejisi

1. Frame Sampling Stratejileri#

Strategy	Frame count	Use case
Uniform	every N seconds (e.g. 1 fps)	short clips
Adaptive	scene change detection	long video
Dense	8-16 fps	action recognition
Sparse	0.5 fps (key frames only)	general Q&A

Token cost: Her frame → 256-1296 token (resolution-dependent). 30-second clip × 1 fps = 30 frames × 256 = 7680 token sadece video.

RTX 4090 constraint: Video context 4-8K range için frame sayısı 8-32 ideal.

✅ Part VI tamamlandı

Qwen 2.5-VL veya LLaVA-Video-7B ile 100 short clip FT. 2) Sonraki Part: Part VII — Speech & Audio (Whisper FT).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Öğrenmeye Başla

Part 0 — Engineering Foundations

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Öğrenmeye Başla

Part 0 — Engineering Foundations

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Öğrenmeye Başla