Skip to content

Llama 3.2 1B / 3B — Edge & Mobile FT: Tied Embeddings + Distillation + GGUF Q4

Llama 3.2 1B/3B — distilled from Llama 3.1 8B. Tied embeddings, edge inference. Full FT possible on RTX 4090 (1B=2GB, 3B=6GB W). 8-15 tok/s on iPhone/Pixel with GGUF Q4_K_M. TR-MMLU numbers and dataset strategies.

Şükrü Yusuf KAYA
34 min read
Advanced
Llama 3.2 1B / 3B — Edge & Mobile FT: Tied Embeddings + Distillation + GGUF Q4

1. 1B/3B'nin 8B'den Farkları#

Feature1B3B8B
Layers162832
Hidden204830724096
Tied embeddings✅ (input=output)
Pre-traindistill from 8Bdistill from 8Bscratch
Active params1.24B3.21B8.03B
Tied embeddings = input embedding matrix ile lm_head ağırlığı paylaşılır. Memory tasarruf: 1B'de ~256 MB. Ama: gradient hesaplaması farklı —
embed_tokens
ve
lm_head
aynı tensor'a referans.
Sonuç: PEFT'te
embed_tokens
veya
lm_head
modify ederken dikkat — biri değişirse otomatik diğeri.

2. RTX 4090'da 3B Full FT — Mümkün!#

3B bf16 weights = 6 GB. Klasik bütçe:
TermValue
W (bf16)6.0 GB
G6.0 GB
O (8-bit AdamW)3.0 GB
A (grad-ckpt, seq=4096, batch=2)4.5 GB
B3.0 GB
Total22.5 GB — sığar!
Avantaj: QLoRA'dan kalite daha yüksek (no quant lossiness). Throughput biraz yavaş (no Unsloth fused for full FT) ama 1 epoch ~80 dakika 4090'da rahat.
python
# === 3.2-1B FastTuning Lab (15 dakika 1 epoch) ===
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
 
model, tok = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
max_seq_length=4096,
dtype="bfloat16",
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=64, # 1B'de r yüksek → daha fazla kalite (capacity)
lora_alpha=128,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
lora_dropout=0.05, bias="none",
use_gradient_checkpointing="unsloth",
)
 
dataset = load_dataset("malhajar/alpaca-gpt4-tr", split="train[:20000]") # küçük dataset 1B için yeter
 
cfg = SFTConfig(
output_dir="llama-3.2-1b-tr",
num_train_epochs=2,
per_device_train_batch_size=8, # 1B küçük, batch büyük olabilir
gradient_accumulation_steps=2,
learning_rate=3e-4,
warmup_ratio=0.03, lr_scheduler_type="cosine",
bf16=True, optim="paged_adamw_8bit",
max_seq_length=2048, packing=True,
dataset_text_field="text",
logging_steps=10, save_steps=200, report_to="wandb",
)
 
trainer = SFTTrainer(model=model, tokenizer=tok, train_dataset=dataset, args=cfg)
trainer.train()
# 20K example × 2 epoch / 8 batch / 2 accum = 2500 step / 5 step/s = 8 dakika 🚀
Llama 3.2 1B FastTuning — 15 dakika 1 epoch

3. GGUF Q4_K_M Export — iPhone/Pixel Deploy#

# 1. Merge adapter into base python -c " from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.2-1B-Instruct', torch_dtype='bfloat16') m = PeftModel.from_pretrained(base, 'llama-3.2-1b-tr') m = m.merge_and_unload() m.save_pretrained('llama-3.2-1b-tr-merged') " # 2. Convert to GGUF (llama.cpp) cd llama.cpp python convert_hf_to_gguf.py ../llama-3.2-1b-tr-merged --outfile llama-1b-tr.fp16.gguf # 3. Quantize Q4_K_M ./llama-quantize llama-1b-tr.fp16.gguf llama-1b-tr.Q4_K_M.gguf Q4_K_M # Output size: ~770 MB (1B × 0.77 byte/p ~= Q4_K_M)
Mobile inference (Llama.cpp Android / MLX iOS):
CihazToken/s (Q4_K_M)RAM kullanımı
iPhone 15 Pro (A17 Pro)14-181.0 GB
Pixel 8 Pro (Tensor G3)8-121.0 GB
M2 MacBook25-351.0 GB
RTX 4090 (overkill)200+1.5 GB
✅ Teslim
  1. 1B veya 3B FT et. 2) GGUF Q4_K_M dönüştür. 3) Eğer iOS/Android cihazın varsa Llama.cpp Android app veya Pocket-LLM ile yükle. 4) Sonraki ders: 3.3 — Qwen 2.5 / Qwen3 — TR Şampiyonu.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content