İçeriğe geç

Quantization (GPTQ/AWQ/GGUF) + Final Capstone: Türkçe ChatGPT Klonu Production'da

Modül 16 capstone (müfredatın final capstone'u): GPTQ, AWQ, GGUF quantization formats. Türkçe Llama-3-8B-Instruct quantize + vLLM serve + Next.js frontend = Türkçe ChatGPT klonu. sukruyusufkaya.com/ai-asistan production deploy. Müfredatın sentezi, gerçek dünya artefakt.

Şükrü Yusuf KAYA
90 dakikalık okuma
İleri
Quantization (GPTQ/AWQ/GGUF) + Final Capstone: Türkçe ChatGPT Klonu Production'da
🎓 Final Capstone — müfredatın sentezi
16 modül boyunca: math, NumPy, PyTorch, transformer mimarisi (tokenization → embedding → attention → position → block), training (pre-train + scaling + distributed), fine-tune (SFT + LoRA + RLHF/DPO), deployment. Şimdi hepsini birleştir: production-grade Türkçe ChatGPT klonu. Quantized Türkçe Llama-3-Instruct + vLLM backend + Next.js frontend = sukruyusufkaya.com/ai-asistan live system. 90 dakika sonra: müfredatın beşinci ve final artefaktına sahipsin. Gerçek dünyada ÇALIŞAN sistem.

Final Capstone Akışı (10 Aşama)#

  1. Sistem mimarisi — full-stack overview
  2. Quantization options — GPTQ vs AWQ vs GGUF
  3. AWQ quantize — Türkçe Llama-3-8B → 4-bit
  4. vLLM backend — quantized model serving
  5. Next.js frontend — chat UI + streaming
  6. API integration — frontend ↔ vLLM
  7. Authentication + rate limiting
  8. Production deploy — Vercel frontend + GPU cloud backend
  9. Monitoring + analytics
  10. Final review — müfredatın 5 capstone'u + ne kazandın

2-3. Quantization Formats#

2.1 Quantization niye#

Llama-3-8B BF16: 16 GB. 4-bit: 4 GB — 4x küçük. Memory + speed avantaj. Quality minor loss.

2.2 GPTQ (Frantar 2022)#

'Optimal Brain Quantization'. Iteratif gradient-based weight quantization. Calibration data needed.
pip install auto-gptq

2.3 AWQ (Lin 2023)#

'Activation-aware Weight Quantization'. Salient weights (high activation magnitude) preserved. Better quality than GPTQ in benchmarks.
pip install autoawq

2.4 GGUF (llama.cpp)#

Georgi Gerganov's format. CPU + GPU inference. llama.cpp ecosystem. Lowest hardware requirements — runs on consumer hardware.

2.5 Quantization quality comparison#

Llama-3-8B perplexity (Türkçe Wiki):
  • BF16: 8.2
  • AWQ 4-bit: 8.5 (+%4)
  • GPTQ 4-bit: 8.7 (+%6)
  • GGUF Q4_K_M: 8.4 (+%2.5)
AWQ ve GGUF best 4-bit quality. GPTQ marginal worse.

2.6 AWQ Türkçe Llama-3 quantize#

from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_path = "sukruyusufkaya/llama-3-8b-tr-instruct" quant_path = "sukruyusufkaya/llama-3-8b-tr-instruct-awq" model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) # Calibration data: Türkçe samples calibration_data = ["Türkçe örnek metin 1...", "Türkçe örnek metin 2...", ...] quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", } model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data) model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)
Result: ~4 GB model, %4 perplexity loss, %50 faster inference.

5-8. Full Stack#

5.1 vLLM backend#

# Quantized model serve python -m vllm.entrypoints.openai.api_server \ --model sukruyusufkaya/llama-3-8b-tr-instruct-awq \ --quantization awq \ --gpu-memory-utilization 0.9 \ --port 8000

5.2 Next.js frontend#

// app/api/chat/route.ts export async function POST(req: Request) { const { messages } = await req.json(); const response = await fetch('http://vllm-backend:8000/v1/chat/completions', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'sukruyusufkaya/llama-3-8b-tr-instruct-awq', messages, stream: true, temperature: 0.7, max_tokens: 500, }), }); // Stream response back to client return new Response(response.body, { headers: { 'Content-Type': 'text/event-stream' }, }); }

5.3 Chat UI#

// components/Chat.tsx import { useState } from 'react'; export default function Chat() { const [messages, setMessages] = useState([]); const [input, setInput] = useState(''); const send = async () => { const newMessages = [...messages, { role: 'user', content: input }]; setMessages(newMessages); setInput(''); const res = await fetch('/api/chat', { method: 'POST', body: JSON.stringify({ messages: newMessages }), }); // Stream handling... const reader = res.body!.getReader(); let assistant = ''; while (true) { const { done, value } = await reader.read(); if (done) break; assistant += new TextDecoder().decode(value); setMessages([...newMessages, { role: 'assistant', content: assistant }]); } }; return ( <div> {messages.map((m, i) => <p key={i}><b>{m.role}:</b> {m.content}</p>)} <input value={input} onChange={e => setInput(e.target.value)} /> <button onClick={send}>Gönder</button> </div> ); }

5.4 Auth + rate limiting#

  • NextAuth for user authentication
  • Rate limit per user per minute (e.g., 30 messages)
  • Optional: Stripe billing tiers

5.5 Production deploy#

  • Frontend: Vercel (Next.js)
  • Backend: GPU cloud (Runpod, Vast.ai, AWS EC2 G5)
  • Domain: ai-asistan.sukruyusufkaya.com
🎉🎉🎉 MÜFREDAT TAMAMLANDI — 5 Capstone Sentezi 🎉🎉🎉
16 modül boyunca: math foundation (1-5) → transformer arch (6-10) → training scaling (11-13) → fine-tune alignment (14-15) → deployment (16). 5 Capstone Artifact: TurkTokenizer-tr (Modül 6), Türkçe Semantic Search (Modül 7), Mini Llama-3 Pre-train (Modül 11), Türkçe Llama-3-Instruct Fine-Tune (Modül 14), Türkçe ChatGPT Klonu (Modül 16). Sıfırdan production'a tam LLM mühendisliği. Toplam: 87 ders, ~85 saat ultra-detaylı içerik. Modül 16 envanteri: 2 ders, 165 dk. Tebrikler — Türkiye'nin en kapsamlı LLM mühendisliği müfredatını tamamladın.

🏆 Müfredat Tamamlandı — Genel Toplam#

Tüm Modüller#

ModülKonuDersSüre
0Kurs Çerçevesi5350 dk
1Matematiksel Cephane10550 dk
2NumPy + Autograd6360 dk
3Felsefi Tarih5280 dk
4LLM Zihinsel Model8470 dk
5PyTorch Mühendislik8510 dk
6Tokenization10660 dk
7Embedding6415 dk
8Attention5370 dk
9Position Encoding5335 dk
10Transformer Block3215 dk
11Pre-training3230 dk
12Scaling Laws3200 dk
13Distributed Training3225 dk
14Fine-tuning (SFT/LoRA/QLoRA)3235 dk
15RLHF + DPO2145 dk
16Production Deployment2165 dk
TOPLAM17 modül87 ders~5715 dk (~95 saat)

5 Capstone Artifact#

  1. TurkTokenizer-tr 32K BPE (Modül 6)
  2. Türkçe Semantic Search Mini-RAG (Modül 7)
  3. Mini Llama-3 100M Param Türkçe Pretrain (Modül 11)
  4. Türkçe Llama-3-8B-Instruct Fine-Tune (Modül 14)
  5. Türkçe ChatGPT Klonu Production (Modül 16)

Türkiye'nin En Kapsamlı LLM Mühendisliği Müfredatı#

Uzun yolculuk tamamlandı. Şimdi inşa et.

Sık Sorulan Sorular

Reasoning models (o1, DeepSeek-R1), Mixture-of-Experts, multi-modal (vision + audio), agents + tool use, evaluation frameworks (HELM, MMLU), AI safety + alignment. Müfredat foundation — şimdi specialization seçimi.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler