Hands-on: Mini Eval Pipeline (Python + Promptfoo)

Name: Hands-on: Mini Eval Pipeline (Python + Promptfoo)
Author: Şükrü Yusuf KAYA

Sıfırdan eval pipeline: golden set, judge, A/B comparison, regression alarm. Promptfoo ile pratik örnek.

Şükrü Yusuf KAYA

14 dakikalık okuma

11.05.2026

İleri

Mini Eval Pipeline Lab

1. Promptfoo Kurulumu#

bash

npm install -g promptfoo
mkdir my-eval && cd my-eval
promptfoo init

Promptfoo setup

2. promptfoo.yaml#

yaml

# Iki prompt versiyonu karşılaştır
description: "Sentiment Classifier v1 vs v2"
 
providers:
  - id: anthropic:claude-haiku-4-5-20251001
    config:
      temperature: 0
      max_tokens: 50
 
prompts:
  - id: v1
    label: "Basit prompt"
    raw: |
      Yorum: "{{review}}"
      Sınıflandır: olumlu/olumsuz/nötr
 
  - id: v2
    label: "Few-shot prompt"
    raw: |
      Aşağıdaki yorumu sınıflandır.
      Örnekler:
      - "Hızlı geldi, harika" → olumlu
      - "Para iadesi gelmedi" → olumsuz
      - "İdare eder" → nötr
 
      Yorum: "{{review}}"
      Sınıf:
 
tests:
  - vars: { review: "Mükemmel kalite, çok beğendim" }
    assert:
      - type: equals
        value: "olumlu"
 
  - vars: { review: "Geç geldi ama ürün güzel" }
    assert:
      - type: equals
        value: "nötr"
 
  - vars: { review: "Tamamen para tuzağı" }
    assert:
      - type: equals
        value: "olumsuz"
 
  # 50 daha test ekle...

promptfoo.yaml

3. Çalıştır + Rapor#

bash

promptfoo eval
 
# UI raporu
promptfoo view

Eval execution

4. Custom Python Pipeline (Daha Esnek)#

python

# Kendi eval pipeline'ın
import json
from anthropic import Anthropic
from dataclasses import dataclass
import os
 
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
@dataclass
class TestCase:
    id: str
    input: str
    expected: str
    category: str
 
GOLDEN = [
    TestCase("g1", "Mükemmel ürün", "olumlu", "easy"),
    TestCase("g2", "Geç geldi ama güzel", "nötr", "hard"),
    TestCase("g3", "Tam felaket", "olumsuz", "easy"),
    # ... 50+
]
 
def run_prompt(prompt_template: str, test: TestCase) -> str:
    prompt = prompt_template.replace("{{review}}", test.input)
    r = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=20, temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.content[0].text.strip().lower()
 
def evaluate(prompt_template: str, name: str):
    correct = 0
    by_category = {}
    for t in GOLDEN:
        pred = run_prompt(prompt_template, t)
        ok = t.expected in pred
        if ok: correct += 1
        by_category.setdefault(t.category, [0, 0])
        by_category[t.category][1] += 1
        if ok: by_category[t.category][0] += 1
 
    print(f"\n=== {name} ===")
    print(f"Total: {correct}/{len(GOLDEN)} = {correct/len(GOLDEN):.0%}")
    for cat, (c, n) in by_category.items():
        print(f"  {cat}: {c}/{n} = {c/n:.0%}")
 
V1 = "Yorum: '{{review}}' → olumlu/olumsuz/nötr?"
V2 = """Aşağıdaki yorumu olumlu/olumsuz/nötr sınıflandır.
 
Yorum: "{{review}}"
Sınıf:"""
 
evaluate(V1, "v1 - basit")
evaluate(V2, "v2 - daha açık")

Custom eval pipeline — kategoriye göre breakdown

5. Regression Alarm#

python

# Her PR'de eval çalış, baseline'dan düşüş varsa fail
def regression_check(current_score: float, baseline: float = 0.85, tolerance: float = 0.02):
    if current_score < baseline - tolerance:
        raise Exception(f"❌ Regression: {current_score:.0%} < {baseline:.0%} - {tolerance:.0%}")
    print(f"✅ Eval pass: {current_score:.0%}")
 
# CI/CD'de:
# if regression_check(eval_score) fails → block merge

CI regression check

Tools karşılaştırması:

Promptfoo — kolay, açık kaynak, başlangıç için ideal
LangSmith (LangChain) — production observability + eval
Langfuse — açık kaynak alternatif
Helicone — gateway + eval
Custom — esnek ama maintenance maliyeti var

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

1. Temeller — Yapay Zekâ ve LLM'lere Giriş