Quantization (GPTQ/AWQ/GGUF) + Final Capstone: Turkish ChatGPT Clone in Production
Module 16 capstone (curriculum's final capstone): GPTQ, AWQ, GGUF quantization formats. Turkish Llama-3-8B-Instruct quantize + vLLM serve + Next.js frontend = Turkish ChatGPT clone. sukruyusufkaya.com/ai-asistan production deploy. Curriculum synthesis, real-world artifact.
Şükrü Yusuf KAYA
90 min read
Advanced🎓 Final Capstone — müfredatın sentezi
16 modül boyunca: math, NumPy, PyTorch, transformer mimarisi (tokenization → embedding → attention → position → block), training (pre-train + scaling + distributed), fine-tune (SFT + LoRA + RLHF/DPO), deployment. Şimdi hepsini birleştir: production-grade Türkçe ChatGPT klonu. Quantized Türkçe Llama-3-Instruct + vLLM backend + Next.js frontend = sukruyusufkaya.com/ai-asistan live system. 90 dakika sonra: müfredatın beşinci ve final artefaktına sahipsin. Gerçek dünyada ÇALIŞAN sistem.
Final Capstone Akışı (10 Aşama)#
- Sistem mimarisi — full-stack overview
- Quantization options — GPTQ vs AWQ vs GGUF
- AWQ quantize — Türkçe Llama-3-8B → 4-bit
- vLLM backend — quantized model serving
- Next.js frontend — chat UI + streaming
- API integration — frontend ↔ vLLM
- Authentication + rate limiting
- Production deploy — Vercel frontend + GPU cloud backend
- Monitoring + analytics
- Final review — müfredatın 5 capstone'u + ne kazandın
2-3. Quantization Formats#
2.1 Quantization niye#
Llama-3-8B BF16: 16 GB. 4-bit: 4 GB — 4x küçük. Memory + speed avantaj. Quality minor loss.
2.2 GPTQ (Frantar 2022)#
'Optimal Brain Quantization'. Iteratif gradient-based weight quantization. Calibration data needed.
pip install auto-gptq
2.3 AWQ (Lin 2023)#
'Activation-aware Weight Quantization'. Salient weights (high activation magnitude) preserved.
Better quality than GPTQ in benchmarks.
pip install autoawq
2.4 GGUF (llama.cpp)#
Georgi Gerganov's format. CPU + GPU inference. llama.cpp ecosystem.
Lowest hardware requirements — runs on consumer hardware.
2.5 Quantization quality comparison#
Llama-3-8B perplexity (Türkçe Wiki):
- BF16: 8.2
- AWQ 4-bit: 8.5 (+%4)
- GPTQ 4-bit: 8.7 (+%6)
- GGUF Q4_K_M: 8.4 (+%2.5)
AWQ ve GGUF best 4-bit quality. GPTQ marginal worse.
2.6 AWQ Türkçe Llama-3 quantize#
from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_path = "sukruyusufkaya/llama-3-8b-tr-instruct" quant_path = "sukruyusufkaya/llama-3-8b-tr-instruct-awq" model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) # Calibration data: Türkçe samples calibration_data = ["Türkçe örnek metin 1...", "Türkçe örnek metin 2...", ...] quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", } model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data) model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)
Result: ~4 GB model, %4 perplexity loss, %50 faster inference.
5-8. Full Stack#
5.1 vLLM backend#
# Quantized model serve python -m vllm.entrypoints.openai.api_server \ --model sukruyusufkaya/llama-3-8b-tr-instruct-awq \ --quantization awq \ --gpu-memory-utilization 0.9 \ --port 8000
5.2 Next.js frontend#
// app/api/chat/route.ts export async function POST(req: Request) { const { messages } = await req.json(); const response = await fetch('http://vllm-backend:8000/v1/chat/completions', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'sukruyusufkaya/llama-3-8b-tr-instruct-awq', messages, stream: true, temperature: 0.7, max_tokens: 500, }), }); // Stream response back to client return new Response(response.body, { headers: { 'Content-Type': 'text/event-stream' }, }); }
5.3 Chat UI#
// components/Chat.tsx import { useState } from 'react'; export default function Chat() { const [messages, setMessages] = useState([]); const [input, setInput] = useState(''); const send = async () => { const newMessages = [...messages, { role: 'user', content: input }]; setMessages(newMessages); setInput(''); const res = await fetch('/api/chat', { method: 'POST', body: JSON.stringify({ messages: newMessages }), }); // Stream handling... const reader = res.body!.getReader(); let assistant = ''; while (true) { const { done, value } = await reader.read(); if (done) break; assistant += new TextDecoder().decode(value); setMessages([...newMessages, { role: 'assistant', content: assistant }]); } }; return ( <div> {messages.map((m, i) => <p key={i}><b>{m.role}:</b> {m.content}</p>)} <input value={input} onChange={e => setInput(e.target.value)} /> <button onClick={send}>Gönder</button> </div> ); }
5.4 Auth + rate limiting#
- NextAuth for user authentication
- Rate limit per user per minute (e.g., 30 messages)
- Optional: Stripe billing tiers
5.5 Production deploy#
- Frontend: Vercel (Next.js)
- Backend: GPU cloud (Runpod, Vast.ai, AWS EC2 G5)
- Domain: ai-asistan.sukruyusufkaya.com
🎉🎉🎉 MÜFREDAT TAMAMLANDI — 5 Capstone Sentezi 🎉🎉🎉
16 modül boyunca: math foundation (1-5) → transformer arch (6-10) → training scaling (11-13) → fine-tune alignment (14-15) → deployment (16). 5 Capstone Artifact: TurkTokenizer-tr (Modül 6), Türkçe Semantic Search (Modül 7), Mini Llama-3 Pre-train (Modül 11), Türkçe Llama-3-Instruct Fine-Tune (Modül 14), Türkçe ChatGPT Klonu (Modül 16). Sıfırdan production'a tam LLM mühendisliği. Toplam: 87 ders, ~85 saat ultra-detaylı içerik. Modül 16 envanteri: 2 ders, 165 dk. Tebrikler — Türkiye'nin en kapsamlı LLM mühendisliği müfredatını tamamladın.
🏆 Müfredat Tamamlandı — Genel Toplam#
Tüm Modüller#
| Modül | Konu | Ders | Süre |
|---|---|---|---|
| 0 | Kurs Çerçevesi | 5 | 350 dk |
| 1 | Matematiksel Cephane | 10 | 550 dk |
| 2 | NumPy + Autograd | 6 | 360 dk |
| 3 | Felsefi Tarih | 5 | 280 dk |
| 4 | LLM Zihinsel Model | 8 | 470 dk |
| 5 | PyTorch Mühendislik | 8 | 510 dk |
| 6 | Tokenization | 10 | 660 dk |
| 7 | Embedding | 6 | 415 dk |
| 8 | Attention | 5 | 370 dk |
| 9 | Position Encoding | 5 | 335 dk |
| 10 | Transformer Block | 3 | 215 dk |
| 11 | Pre-training | 3 | 230 dk |
| 12 | Scaling Laws | 3 | 200 dk |
| 13 | Distributed Training | 3 | 225 dk |
| 14 | Fine-tuning (SFT/LoRA/QLoRA) | 3 | 235 dk |
| 15 | RLHF + DPO | 2 | 145 dk |
| 16 | Production Deployment | 2 | 165 dk |
| TOPLAM | 17 modül | 87 ders | ~5715 dk (~95 saat) |
5 Capstone Artifact#
- TurkTokenizer-tr 32K BPE (Modül 6)
- Türkçe Semantic Search Mini-RAG (Modül 7)
- Mini Llama-3 100M Param Türkçe Pretrain (Modül 11)
- Türkçe Llama-3-8B-Instruct Fine-Tune (Modül 14)
- Türkçe ChatGPT Klonu Production (Modül 16)
Türkiye'nin En Kapsamlı LLM Mühendisliği Müfredatı#
Uzun yolculuk tamamlandı. Şimdi inşa et.
Frequently Asked Questions
Reasoning models (o1, DeepSeek-R1), Mixture-of-Experts, multi-modal (vision + audio), agents + tool use, evaluation frameworks (HELM, MMLU), AI safety + alignment. Curriculum is foundation — now choose specialization.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup