Custom Stack FT Lab: Repo-Tuned Model on Mid-Size Repo (~50K LoC)
FT for company internal codebase: 50K LoC Python+TypeScript repo. File hierarchy preservation, internal symbol awareness, test file pairing, commit history mining (good/bad code), 7B model 4-6h FT on RTX 4090.
Şükrü Yusuf KAYA
28 min read
Advancedpython
# === Custom repo dataset üretimi ===import os, json def walk_repo(repo_path, include_exts=(".py", ".ts", ".tsx", ".js")): samples = [] for root, dirs, files in os.walk(repo_path): dirs[:] = [d for d in dirs if not d.startswith(".") and d != "node_modules"] for f in files: if f.endswith(include_exts): path = os.path.join(root, f) with open(path) as fp: code = fp.read() rel = os.path.relpath(path, repo_path) # Each file: path + content header samples.append({ "text": f"# File: {rel}\n\n{code}", }) return samples samples = walk_repo("/path/to/my-repo")# 50K LoC repo → ~3000-5000 file samples # Save as HF datasetfrom datasets import Datasetds = Dataset.from_list(samples)ds.save_to_disk("custom-repo-ds") # Then: FIM transform + Qwen2.5-Coder 7B FT (8.2'deki gibi)Custom repo dataset üretimi
✅ Teslim
- Kendi repo'nu (veya açık repo) walk et, dataset üret. 2) Qwen2.5-Coder 7B FT. 3) Sonraki ders: 8.7 — Code Eval (HumanEval + MBPP + SWE-Bench).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations