SmolLM3 1.7B: Tiny Tier — Production Model Running on 8GB RAM Devices

SmolLM3 (HuggingFace, Mar 2025) — 1.7B params, hybrid GQA, 256K context (YaRN), 100% open (data, training pipeline, weights). Edge target: 8GB RAM phone / RPi 5 / IoT. Full FT on RTX 4090 in 25 min. Q4_K_M GGUF → 1.0 GB.

Şükrü Yusuf KAYA

26 min read

5/14/2026

Intermediate

SmolLM3 1.7B: Tiny Tier — 8GB RAM'li Cihazda Çalışan Production Model

1. SmolLM3 Mimari & Açıklık#

Layers: 30, hidden: 2048, KV heads: 4
Vocab: 49,152 (BPE, multilingual)
11T pre-train token (web + math + code)
Hybrid GQA: bazı katmanlar full attention, bazıları local-window (efficiency için)

Açıklık: HuggingFace, SmolLM3 ile TÜM training pipeline'ını açtı:

Data sources + cleaning recipes
Training code (nanotron + nanoseed)
Checkpoints (intermediate dahil)
Eval scripts

Bu = "reproducible LLM" anlamında en yakın model. Cookbook'ta SmolLM3 öğretim için ideal — kodu okuyabilirsin.

2. Edge Deploy Senaryoları#

Cihaz	Q4 token/s	RAM
Raspberry Pi 5 (8GB)	4-6	1.0 GB
Pixel 8 Pro	18-25	1.0 GB
iPhone 15 Pro	28-35	1.0 GB
MacBook M2 Air (8GB)	50-70	1.0 GB

Use cases:

Offline chatbot (havalimanı kiosk, askeri terminal)
IoT cihaz local NLP (Türkçe sesle smart home komut)
Smart wearable (saat üzerinde simple Q&A)
TR çeviri offline backup

python

# === SmolLM3 1.7B Full FT (RTX 4090) ===
# Full FT mümkün çünkü 1.7B × 2 = 3.4 GB W + grad
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
 
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-1.7B-Instruct",
    torch_dtype="bfloat16",
    attn_implementation="flash_attention_2",
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-1.7B-Instruct")
 
dataset = load_dataset("malhajar/alpaca-gpt4-tr", split="train").map(...)
 
cfg = SFTConfig(
    output_dir="smol-1.7b-tr-fullft",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,                       # Full FT için düşük lr
    warmup_ratio=0.05, lr_scheduler_type="cosine",
    weight_decay=0.01,
    bf16=True, optim="paged_adamw_8bit",
    gradient_checkpointing=True,
    max_seq_length=4096, packing=True,
    dataset_text_field="text",
    logging_steps=5, save_steps=200, report_to="wandb",
)
 
SFTTrainer(model=model, tokenizer=tok, train_dataset=dataset, args=cfg).train()
# 25 dakikada 1 epoch — 4090'ın tadını çıkarıyoruz

SmolLM3 1.7B Full FT — RTX 4090 25 dakika

✅ Teslim

SmolLM3 1.7B full FT et. 2) Q4_K_M dönüştür. 3) Eğer RPi 5 / Android cihazın varsa deploy et. 4) Sonraki ders: 3.9 — DeepSeek-R1-Distill (Reasoning Distillation).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

SmolLM3 1.7B: Tiny Tier — Production Model Running on 8GB RAM Devices

1. SmolLM3 Mimari & Açıklık#

2. Edge Deploy Senaryoları#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter