Video LLM FT: LLaVA-NeXT-Video + VideoLLaMA3 + Frame Sampling Strategy
Video LLM — image's temporal extension. LLaVA-NeXT-Video, VideoLLaMA3, Qwen 2.5-VL native video. Frame sampling (uniform vs adaptive), temporal token compression, long-video Q&A (>1h). Video LLM FT on RTX 4090 — practical with short clips (10-30s).
Şükrü Yusuf KAYA
26 min read
Advanced1. Frame Sampling Stratejileri#
| Strategy | Frame count | Use case |
|---|---|---|
| Uniform | every N seconds (e.g. 1 fps) | short clips |
| Adaptive | scene change detection | long video |
| Dense | 8-16 fps | action recognition |
| Sparse | 0.5 fps (key frames only) | general Q&A |
Token cost: Her frame → 256-1296 token (resolution-dependent). 30-second clip × 1 fps = 30 frames × 256 = 7680 token sadece video.
RTX 4090 constraint: Video context 4-8K range için frame sayısı 8-32 ideal.
✅ Part VI tamamlandı
- Qwen 2.5-VL veya LLaVA-Video-7B ile 100 short clip FT. 2) Sonraki Part: Part VII — Speech & Audio (Whisper FT).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations