Skip to content

Production Engineering: Reproducibility, CI/CD for ML, Versioning, and Deployment Patterns

Final lesson of PyTorch engineering — production workflow patterns: ML CI/CD pipelines, eval harness CI integration, model + prompt + data versioning (DVC, MLflow, HF Hub), canary deployment, A/B testing, rollback strategies, drift monitoring, KVKK-compliant deployment. Closing of Part I.

Şükrü Yusuf KAYA
60 min read
Intermediate
Production Engineering: Reproducibility, CI/CD for ML, Versioning ve Deployment Patterns
🚀 Production'a giden son köprü
Modül 5'in son dersi ve Part I'in kapanışı. Şimdiye kadar matematik (Modül 1), autograd (Modül 2), tarih (Modül 3), zihinsel model (Modül 4), PyTorch mühendisliği (5.1-5.7) gördük. Bir LLM mühendisinin cephaneliği tamamlandı. Bu derste tüm parçaları birleştiriyoruz: production workflow. 60 dakika sonra: ML CI/CD pipeline yazabilir, model versioning + canary deploy uygulayabilir, KVKK uyumlu deployment planlayabilirsin.

Ders Haritası#

  1. ML CI/CD ile classical SDE CI/CD farkı
  2. Reproducibility temelleri — pinned dependencies, lock files
  3. Code + data + model versioning
  4. MLflow experiment tracking
  5. DVC data versioning
  6. HuggingFace Hub model registry
  7. Eval harness CI'a integration
  8. Canary deployment strategies
  9. A/B testing for ML
  10. Drift monitoring + auto-rollback
  11. KVKK uyumlu deployment
  12. Part I kapanışı + Part II'ye geçiş

1. ML CI/CD ile Classical SDE CI/CD Farkı#

Klasik yazılım: kod → test → deploy. Lineer.
ML: 3 boyut — kod, data, model. Her biri ayrı versionlanabilir.

Klasik SDE CI/CD#

Code → Lint → Unit test → Build → Deploy
Single artifact (binary, container).

ML CI/CD#

Code change → ? Data change → ? Model change → ? Prompt change → ?
4 ayrı trigger. Hangi değişiklik ne kadar test gerektirir?

Trigger matrix#

DeğişiklikRequired tests
Code changeLint + unit + integration + eval (light)
Data changeEval (full) + data validation
Model changeFull eval suite + canary
Prompt changeEval (specific) + small A/B
Config changeEval (full) + canary

Modern öneri#

Single pipeline her trigger için tetiklenir ama adaptive depth:
  • Small change → fast tests (5 min)
  • Big change → full eval (1-3 hour)
GitHub Actions, GitLab CI, Buildkite — hepsi destekliyor conditional jobs.

2. Reproducibility Temelleri#

LLM training run'ı 6 ay sonra aynı şekilde rerun edebilmelisin. 4 katman:

Katman 1: Code#

  • Git commit hash her artifact ile birleşik save
  • Branch policy: main locked, PR gerekli
  • Tag releases:
    v1.2.3
    ile checkpoint

Katman 2: Dependencies#

  • Pinned versions:
    requirements.txt
    exact versions
  • Lock file:
    poetry.lock
    ,
    uv.lock
    ,
    requirements-lock.txt
  • Container: Dockerfile pin everything (CUDA version, PyTorch, OS)
# requirements-lock.txt torch==2.5.1+cu124 transformers==4.46.2 deepspeed==0.15.0

Katman 3: Data#

  • Dataset version: HuggingFace dataset hash
  • Snapshot date: data fetched on YYYY-MM-DD
  • DVC tracking: git-like for big data (aşağıda)

Katman 4: Hardware/Random#

  • Seed set: numpy, torch, cuda all
  • Deterministic mode: optional but available
  • Hardware spec: documented (GPU model, driver, CUDA)

Production reproducibility report#

Her training run sonunda:
{ "run_id": "uuid", "timestamp": "2026-05-12T10:00:00Z", "git_commit": "abc123def", "git_branch": "main", "dataset_hash": "sha256:...", "config_hash": "sha256:...", "hardware": "8x H100 80GB", "torch_version": "2.5.1+cu124", "deepspeed_version": "0.15.0", "final_loss": 1.85, "checkpoint_uri": "s3://...", "eval_results": {...} }
Bu JSON ile 6 ay sonra rerun mümkün.

3. Code + Data + Model Versioning#

Code: Git (standart)#

src/ train.py eval.py configs/ llama_3_8b.yaml llama_3_70b.yaml .gitignore # data, checkpoints

Data: DVC veya HuggingFace Datasets#

DVC (Data Version Control):
dvc init dvc remote add origin s3://bucket/dvc-store dvc add data/turkish_corpus_v1.parquet git add data/turkish_corpus_v1.parquet.dvc git commit -m "Add Turkish corpus v1" dvc push # Upload to S3
Git tracks small
.dvc
files (hash + metadata), actual data in S3. Diff ve revert mümkün large dataset için.
HuggingFace Datasets: dataset Hub'a push edilebilir, version-controlled.
from datasets import Dataset ds = Dataset.from_parquet("...") ds.push_to_hub("sukruyusufkaya/turkce-egitim-v1")

Model: HuggingFace Hub + MLflow#

HF Hub:
model.push_to_hub("sukruyusufkaya/llama3-tr-v1.0")
Versioned, git-like, public/private.
MLflow experiment tracking:
import mlflow with mlflow.start_run(): mlflow.log_param("lr", 3e-4) mlflow.log_metric("loss", 1.85) mlflow.pytorch.log_model(model, "model")
Tüm experiment'leri search edilebilir database'de.

Prompt versioning#

Prompts kod gibi versionla:
# prompts/customer_support_v3.yaml version: 3 system_prompt: | Sen Trendyol müşteri destek asistanısın... created: 2026-04-15 author: ali.veli changelog: | v3: Türkçe argo handling iyileştirildi
Git'le track et. Production'da prompt version field her log'da.

4. MLflow — Experiment Tracking#

MLflow endüstri standardı experiment tracking.

Setup#

pip install mlflow mlflow server --host 0.0.0.0 --port 5000 # production deployment

Usage#

import mlflow mlflow.set_tracking_uri("http://mlflow-server:5000") mlflow.set_experiment("llama3-tr-finetune") with mlflow.start_run(run_name="v1.0-baseline"): # Params mlflow.log_params({ "model": "Llama-3.1-8B", "lr": 3e-4, "batch_size": 32, "epochs": 3, }) # Metrics (each step) for epoch in range(3): for step, batch in enumerate(train_loader): loss = train_step(batch) mlflow.log_metric("loss", loss, step=step + epoch*len(train_loader)) # Artifact: model mlflow.pytorch.log_model( model, artifact_path="model", registered_model_name="llama3-tr", ) # Custom artifact: eval result with open("eval_results.json", "w") as f: json.dump(eval_results, f) mlflow.log_artifact("eval_results.json")

Features#

  • UI: experiments comparison, plot metrics, model artifacts
  • Model Registry: model lifecycle (Staging, Production, Archived)
  • API: programmatic access
  • Deployment integration:
    mlflow.pytorch.load_model()
    direkt prod

Türk şirketler#

Trendyol, Hepsiburada, Yapı Kredi — MLflow yaygın kullanım. Production'da self-hosted veya managed (Databricks).

5. DVC — Data Version Control#

Büyük data (GB-TB) git'le track edemezsin. DVC çözüm.

Workflow#

# Init git init dvc init # Add remote dvc remote add -d myremote s3://mybucket/dvc-store # Add data dvc add data/turkish_corpus_50gb.parquet # DVC creates: data/turkish_corpus_50gb.parquet.dvc (small, git-trackable) # Actual file in .dvc/cache/ git add data/turkish_corpus_50gb.parquet.dvc .gitignore git commit -m "Add corpus v1" dvc push # Upload to S3 # Update data dvc add data/turkish_corpus_50gb.parquet # detect changes git commit -am "Update corpus to v2" dvc push # Time travel git checkout v1-tag dvc pull # download v1 data

Pipelines#

# dvc.yaml stages: preprocess: cmd: python preprocess.py deps: - data/raw.parquet outs: - data/processed.parquet train: cmd: python train.py deps: - data/processed.parquet - src/train.py outs: - models/checkpoint.pt
dvc repro
— incremental: sadece değişmiş stage'i rerun.

Benefits#

  • Git-like data: branch, diff, merge
  • Lazy fetch: sadece gerekli data download
  • Pipeline caching: incremental updates
  • Reproducibility: code + data history aligned

Türk perspektif#

Türkçe NLP projelerinde 50GB+ corpus yaygın. DVC + S3 (veya MinIO local) standard pattern.

6. HuggingFace Hub — Model Registry#

Model artifact storage + version control.

Push#

from transformers import AutoModelForCausalLM # Fine-tuned model'i hub'a push model.push_to_hub( "sukruyusufkaya/llama3-tr-customer-support-v1", use_auth_token=True, commit_message="Initial release: v1.0 fine-tuned on customer support data", )

Versioning#

HF Hub git-like — her commit hash'i kayıtlı. Specific version pull:
model = AutoModelForCausalLM.from_pretrained( "sukruyusufkaya/llama3-tr-customer-support-v1", revision="abc123def", # specific git hash )

Model card#

Her model için README.md (model card):
--- license: apache-2.0 language: - tr base_model: meta-llama/Llama-3.1-8B tags: - turkish - customer-support --- # Llama 3 Türkçe Customer Support v1.0 ## Training data 50K customer-support conversations, Türkçe. ## Eval metrics - TR-MMLU: 62% - Customer-support accuracy: 87% ## Limitations - ...
KVKK + EU AI Act için gerekli documentation.

Private models#

model.push_to_hub(repo_id, private=True) # team-only

Comparison to MLflow#

  • MLflow: experiment tracking + on-prem registry
  • HF Hub: artifact-focused, public sharing, integrated with transformers/datasets
Modern stack: MLflow training, HF Hub sharing. İkisi de var.

7. Eval Harness CI'a Integration#

ML CI/CD'nin kalbi: eval harness.

Eval suite tasarımı#

# tests/eval/test_model.py import pytest from my_model import MyModel from eval_harness import EvalSuite @pytest.mark.eval def test_model_accuracy(): model = MyModel.from_pretrained("path/to/checkpoint") suite = EvalSuite( benchmarks=["mmlu", "gsm8k", "humaneval"], languages=["en", "tr"], ) results = suite.run(model) # Thresholds (regression prevention) assert results["mmlu"] > 0.60, f"MMLU regression: {results['mmlu']}" assert results["gsm8k"] > 0.45, f"GSM8K regression: {results['gsm8k']}"

CI integration#

# .github/workflows/ml-ci.yml name: ML CI on: pull_request: paths: - 'src/**' - 'configs/**' jobs: fast-eval: runs-on: gpu-runner steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: pytest tests/eval/fast/ -m eval # 5-10 min, basic regression check full-eval: runs-on: gpu-runner if: github.event.pull_request.labels.*.name == 'full-eval' steps: - run: pytest tests/eval/full/ -m eval # 1-3 hour, full benchmark suite
PR'larda fast eval otomatik, full eval label ile manuel trigger.

Türkçe için custom eval#

# Türkçe-specific eval turkish_benchmarks = ["tr-mmlu", "tr-gsm", "trendyol-customer-support-test"] suite = EvalSuite(benchmarks=turkish_benchmarks)
Modül 53 + 59 (Türkçe Eval Atölyesi) detayda.

Benchmark regression matrix#

CI dashboard'da her metric trend görüntülenir:
  • Last 30 PR
  • Trend up/flat/down
  • Regression alert (>%2 drop)

8. Canary Deployment — Gradual Rollout#

Yeni model production'a yavaş yavaş geçiş.

Strategy#

Day 0: %0 new model, %100 old (current state) Day 1: %5 new, %95 old (canary) Day 2: %25 new, %75 old (if metrics OK) Day 3: %50 new, %50 old Day 4: %100 new (full rollout)

Implementation patterns#

Pattern 1: Feature flag

def get_model(user_id): if feature_flag.is_enabled("new_model_canary", user_id): return new_model return old_model
LaunchDarkly, GrowthBook gibi tools.

Pattern 2: Load balancer routing

# kubernetes spec: rules: - host: api.example.com http: paths: - backend: service: new-model-service weight: 5 # %5 - backend: service: old-model-service weight: 95

Pattern 3: Sticky sessions

Aynı user her zaman aynı model'i görür. Consistent experience.

Metrics to monitor#

Canary phase'inde dikkat:
  • Error rate: yeni model crash mi?
  • Latency p99: yavaş mı?
  • User satisfaction: thumbs down up?
  • Cost: token consumption?
  • Quality: hallucination, accuracy

Auto-rollback#

Threshold aşılırsa otomatik rollback:
if error_rate > 2% AND error_rate_baseline < 0.5%: rollback_to_old_model() alert_team()
Production-grade ML system'in must-have feature'ı.

9. A/B Testing for ML#

Canary deployment incremental rollout. A/B testing scientific comparison.

Setup#

  • Variant A: old model
  • Variant B: new model
  • Random assignment: %50-50 (or different split)
  • Metric: success criterion (accuracy, NPS, conversion)

Statistical considerations#

Sample size

N = ceil(2 × (Z_α + Z_β)^2 × σ^2 / Δ^2) α=0.05, β=0.2, σ=0.1 (std), Δ=0.02 (effect) N ≈ 200 per variant
LLM testlerinde tipik N=500-5000 per variant.

P-value

t-test veya bootstrap → significance test.

Bayesian alternative

Modern A/B testing tools (Statsig, Optimizely) Bayesian → cleaner interpretation.

Pratik framework#

from ab_testing import Experiment exp = Experiment( name="prompt-v3-vs-v2", variants={"control": prompt_v2, "treatment": prompt_v3}, metric="user_satisfaction", target_n=1000, ) exp.start() # Real traffic split

Common pitfalls#

  1. Peeking: sample size'a ulaşmadan karar verme
  2. Multiple testing: birçok metric → false positive
  3. Novelty effect: yeni şey ilk gün ilginç, sonra erozyon
  4. Network effects: variants etkileşiyor mu?

Türk perspektif#

Trendyol, Hepsiburada A/B testing yoğun kullanıyor. LLM prompt iteration'ı A/B test ile validate ediliyor.

10. Drift Monitoring + Auto-Rollback#

Production model zamanla bozulur. Niye?

Drift types#

1. Data drift

Input distribution değişiyor. Örn: Türkçe customer support 2024'te "GPT" kelimesi nadirdi, 2026'da yaygın. Model GPT-bağlamlı sorulara hazır değil.
Detection: input embedding distribution KL divergence baseline'a göre.

2. Concept drift

Output relationship değişiyor. Örn: "iyi bir film" 2020'de Avengers, 2026'da farklı genre.
Detection: model output distribution + user feedback trends.

3. Performance drift

Accuracy, NPS yavaşça düşüyor.
Detection: continuous eval on canary traffic.

Monitoring stack#

Input → [model] → Output ↓ ↓ Drift detector Performance monitor ↓ ↓ Alert + retrain trigger

Auto-rollback triggers#

  • Error rate > 2%: immediate rollback
  • Latency p99 > 5s: rollback
  • User feedback negative > 30%: rollback within 1 hour
  • Hallucination rate > 5%: rollback (Modül 56)

Production tooling#

  • Langfuse (Modül 48): LLM-specific drift detection
  • Evidently AI: ML model drift
  • Arize: production ML observability
  • Prometheus + custom rules: low-level monitoring

11. KVKK Uyumlu Deployment#

Türkiye'de production LLM KVKK ile uyumlu olmalı. Modül 57 detayda, burada deployment-spesifik:

Data residency#

KVKK: kişisel veri Türkiye'de kalmak zorunda (genelde).
Production stack: - LLM training: cloud GPU (AWS Frankfurt, GCP Belgium) — uygun coğrafya - LLM inference: Türkiye'de hosted (Aselsan, BiSU, on-prem) - Customer data: Türkiye DB

Audit trail#

Her LLM çağrısı:
  • User ID (anonymized OK)
  • Timestamp
  • Input (PII-stripped veya encrypted)
  • Output
  • Model version
  • Cost
  • Latency
KVKK regulators audit talep edebilir.

Right to be forgotten#

Kullanıcı "verim silinsin" derse:
  • Training data'dan removable mı?
  • Model memorize ettiyse (Modül 4.8) — challenging
  • Output cache'lerden silmek mümkün
  • LLM'le konuşma: explicit consent (T&C)
  • Data fine-tuning'e gidiyorsa: explicit + ayrı consent
  • Audit log: consent timestamp + version

Production checklist#

  • Data residency Turkey
  • Audit logs PII-redacted
  • Right-to-forget pipeline
  • Consent management system
  • Encryption at rest + in transit
  • Access control (RBAC)
  • Incident response plan
  • DPO (Veri Sorumlusu) assignment
  • Periodic privacy audit
Modül 57 (Compliance) full playbook.

12. Part I Kapanışı + Part II'ye Geçiş#

🎉 Part I Tamamlandı!#

Mühendisin Cephaneliği — Matematik, Programlama, Zihin Modeli
ModülKonuDers
0Kurs çerçevesi + atölye kurulumu5
1Matematiksel cephanelik (LinAlg, calculus, prob, info, optim)10
2NumPy + autograd sıfırdan6
3Derin öğrenmenin felsefi tarihi5
4LLM'lerin zihinsel modeli8
5PyTorch mühendisliği8
Toplam42 ders

Ne öğrendin?#

Bir mühendis olarak şimdi:
  • Matematik: lineer cebir, calculus, olasılık, bilgi teorisi, optimization
  • Programming: PyTorch internals, custom autograd, Triton kernels
  • Sistem: distributed training, memory profiling, debugging
  • Kavramsal: LLM'in olasılıksal modeli, scaling laws, emergent capabilities
  • Tarih: 70 yıllık AI evrim, paradigma değişimleri
  • Production: CI/CD, versioning, deployment, KVKK

Sıradaki: Part II — Transformer Mimarisinin İskeleti#

8 modül, 76 ders. Transformer'ın kalbi:
  • Modül 6: Tokenization Mikro-Cerrahisi
  • Modül 7: Embedding'ler — Anlamın Geometrisi
  • Modül 8: Attention'ın Matematiksel İnşası
  • Modül 9: Position Encoding — RoPE, ALiBi
  • Modül 10: Transformer Bloğunu Sıfırdan
  • Modül 11: Modern Mimari Aileleri (Llama, Qwen, DeepSeek)
  • Modül 12: Mixture of Experts (MoE)
  • Modül 13: Transformer Alternatifleri (Mamba, RWKV)

Senin işin Part I sonrası#

  1. Kendi mini LLM'in çalışıyor olmalı (Modül 2 capstone)
  2. Production-ready PyTorch stack hâkim
  3. Bir paper okuyup anlamalısın (özellikle FlashAttention, Llama, DeepSeek-V3)
  4. Yapılandırılmış debug workflow'un olmalı
  5. CI/CD pipeline kurabilmelisin
Modül 6'da görüşmek üzere. 🚀

13. Mini Egzersizler#

  1. Reproducibility test: 6 ay önce çalışan bir LLM training run'ını rerun. Aynı sonucu nasıl alırsın?
  2. CI design: Türk e-ticaret şirketi için fine-tune workflow. CI pipeline tasarla.
  3. Canary metrics: 100K req/day LLM service'a yeni model deploy. İlk 24 saatte hangi metric'leri izle?
  4. Drift example: Türkçe customer support model 6 ay sonra accuracy %5 düştü. Olası sebep + çözüm?
  5. KVKK incident: Kullanıcı PII LLM training'e girdi. İlk 24 saat aksiyon planı?

Bu Derste Neler Öğrendik?#

ML CI/CD classical SDE'den farkı — 3 boyut (code/data/model) ✓ Reproducibility 4 katmanı — code, deps, data, hardware/random ✓ Versioning: DVC (data), MLflow (experiments), HF Hub (models), Git (prompts) ✓ Eval harness CI integration — fast vs full eval matrix ✓ Canary deployment — gradual rollout strategies ✓ A/B testing for ML — statistical considerations ✓ Drift monitoring: data, concept, performance drift ✓ KVKK uyumlu deployment — data residency, audit, consent, right-to-forget ✓ Auto-rollback triggers

🎉 Part I Tamamlandı!#

42 ders, ~2200 dk (~37 saat) içerik. Mühendisin cephaneliği hazır.

Sıradaki Bölüm#

Part II — Transformer Mimarisinin İskeleti (8 modül, 76 ders) Modül 6'dan başlayarak transformer'ın her bileşenini detaylı işleyeceğiz. Tokenization, embedding, attention, position encoding, modern mimariler. Karpathy seviyesinde sıfırdan inşa.
Modül 6: Tokenization Mikro-Cerrahisi — 10 ders, ~120 dk.
Görüşmek üzere! 🚀

Frequently Asked Questions

Someone completing Part I has technical knowledge at **Junior LLM Engineer / Mid AI Engineer** level. In interviews you can handle: PyTorch internals, distributed training concepts, mixed precision, eval design, production patterns. Missing: production experience (unless capstone projects fill the gap). With Parts II-IV you reach **Senior LLM Engineer** level. Module 63 (Career) gives detailed path.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content