llama.cpp + Ollama: GGUF Serving + Modelfile + System Prompt Versioning

llama.cpp + Ollama — gold standard for CPU/Apple Silicon/edge. GGUF format, Ollama's Modelfile (system prompt + tools versioning), Ollama API, OpenAI-compatible endpoint. Q4_K_M Llama 8B in Ollama on RTX 4090: 95 tok/s.

Şükrü Yusuf KAYA

24 min read

6/24/2026

Intermediate

llama.cpp + Ollama: GGUF Serving + Modelfile + System Prompt Versioning

bash

# === Ollama Modelfile — TR custom assistant ===
# Dosya: TurkceAsistan.Modelfile
 
FROM llama3.1:8b-instruct-q4_K_M       # base
 
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.05
 
SYSTEM """Sen 'Yıldız' adında bir Türk AI asistanısın.
- Cevapların doğal Türkçe olsun.
- Belirsizlik durumunda 'bilmiyorum' demeyi tercih et.
- Karşındaki kişinin yaşı ve uzmanlık seviyesini göz önünde bulundur.
- Karmaşık konuları örneklerle açıkla.
"""
 
TEMPLATE """<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
 
# Build + serve
# ollama create yildiz -f TurkceAsistan.Modelfile
# ollama run yildiz
 
# API (OpenAI-uyumlu)
curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "yildiz",
      "messages": [{"role": "user", "content": "İstanbul nüfusu?"}]
    }'

Ollama Modelfile + serving

1. RTX 4090 + Q4_K_M Llama 8B (Ollama)#

Workload	tok/s
Batch=1 generation	95
Batch=4 parallel	240
Streaming first-token	180 ms TTFT

Karşılaştırma: vLLM AWQ int4 (batch=1) 175 tok/s — Ollama daha yavaş. Ama Ollama'nın "sıfır setup" + Modelfile'ın system prompt versioning'i bazı use-case'lerde tercih.

Cookbook'un kuralı:

High-throughput API → vLLM / SGLang / TGI
Single-user lokal chat → Ollama (UX + Modelfile)
Apple Silicon → Ollama veya MLX-LM

✅ Teslim

Kendi TR Modelfile'ını yaz. 2) Ollama API ile test. 3) Sonraki ders: 15.7 — MLX-LM Apple Silicon.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

llama.cpp + Ollama: GGUF Serving + Modelfile + System Prompt Versioning

1. RTX 4090 + Q4_K_M Llama 8B (Ollama)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter