FIM (Fill-in-the-Middle) Format: Prefix + Suffix → Middle Token Logic
Spine of code completion: FIM. Classic LLM next-token prediction insufficient for code — in real IDE cursor is in middle, prefix + suffix exist. FIM training format. Dataset prep: random split + transform existing code. Bayraghani et al. 2022 paper foundation.
Şükrü Yusuf KAYA
28 min read
Advanced1. FIM Niye Önemli?#
Klasik LLM: . Real IDE'de:
P(next | prefix)def calculate_average(numbers): if not numbers: return 0 █ ← cursor here return total / len(numbers)
Imleç ortada — prefix VAR, suffix DA VAR. Klasik LLM next-token tahmini suffix'i göremez → "ne yazılması lazım" sorusuna kötü cevap.
FIM: Prefix + Suffix + Middle (yazılacak) birlikte verilir.
<fim_prefix>def calculate_average(numbers): if not numbers: return 0 <fim_suffix> return total / len(numbers) <fim_middle> total = sum(numbers)
Training trick: Mevcut code dosyalarını random 3 parçaya böl. Middle'ı target yap, prefix+suffix'i context.
python
# === FIM Dataset üretimi ===import random def to_fim(code, prob=0.5): """Code'u random olarak prefix/middle/suffix'e böl.""" if random.random() > prob: # Normal next-token prediction (PSM probability) return code lines = code.split("\n") if len(lines) < 5: return code # Random split points start = random.randint(1, len(lines) - 2) end = random.randint(start + 1, len(lines) - 1) prefix = "\n".join(lines[:start]) middle = "\n".join(lines[start:end]) suffix = "\n".join(lines[end:]) # FIM format (StarCoder 2 / Qwen Coder convention) return f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>{middle}" # Dataset mapdataset = dataset.map(lambda x: {"text": to_fim(x["code"])}) # Training: SFTTrainer ile aynı, packing=True# FIM token'lar tokenizer.special_tokens'a eklenmiş olmalıFIM dataset üretimi
✅ Teslim
- GitHub permissive Python dataset al, FIM format'a dönüştür. 2) Mini Qwen2.5-Coder 1.5B FT. 3) Sonraki ders: 8.2 — Qwen2.5-Coder Recipes.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations