If Anthropic has no logprobs, how to detect hallucinations?

Alternative approaches: (1) **Multi-sample SelfCheck**: temperature > 0, multiple samples, semantic consistency check (NLI or BERTScore). (2) **Confidence via structured output**: add `confidence: 0-1` field to schema, ask model to self-evaluate — empirically works but miscalibrated. (3) **External verifier**: second model (Claude or GPT) verifies. (4) **RAG citation**: hallucination low if response source-grounded. Combining multiple approaches is safest.

logprobs suggest the response is reliable, but it's wrong. Why?

Because logprobs measure **distribution confidence**, not **factual correctness**. Model can confidently say 'Ankara is in Spain' with high logprob — because that pattern looked coherent in training. **Practical message**: logprobs are just a signal. For factuality: RAG verification, tool use (search), external knowledge base, multi-sample voting. No single heuristic is 100% reliable.

In MCQ scoring, why is the answer token ' A' (with space) instead of 'A'?

BPE tokenization (covered in Module 4.2). In modern tokenizers, 'Cevap: A' tokenizes 'A' as ' A' (with leading space). If prompt ends with ':', model expects space + letter. **Practical**: check both ' A' and 'A' formats, take max. OpenAI cl100k tokenizer typically uses ' A'. Llama 3 may differ — test per model.

Isn't 3 samples enough instead of 5 in SelfCheck-NLI?

Trade-off. 3 samples → 80%+ inconsistency detection, but may miss edge cases. 5 samples → 90%+. 10+ → 95%+. Practical: **high-stakes 5-10, medium 3-5, low 0 (no SelfCheck)**. Cost: each sample N× input cost (output may be short). Modern alternative: **reasoning model** (o1, R1) — single sample matches multi-sample voting quality.

How to calibrate logprobs thresholds for Turkish?

Empirical calibration: (1) Prepare Turkish dataset (200+ Q&A with ground truth). (2) Run model, measure avg_logprob for responses. (3) Plot logprob distributions for correct vs incorrect. (4) Threshold = optimal separation point (typically -3.5 for Turkish, -2.5 for English). (5) Evaluate precision/recall trade-off via ROC curve. Module 53 (Evaluation) details these calibration methods. **TurkEval-Suite** capstone (C10) builds on this pattern.

Logit Observability: Reading the Model's Mind with logprobs — Production Diagnostics

Production-grade use of logprobs API: confidence-based filtering, hallucination detection, prompt diagnostics, model probing, MCQ scoring, semantic confidence, anomaly detection. logits/probability/log-probability conversions, token-level entropy, extraction techniques.

Şükrü Yusuf KAYA

55 min read

5/13/2026

Intermediate

Logit Gözlemciliği: logprobs ile Modelin Zihnini Okuma — Production Diagnostics

🔍 Modelin 'zihninin içine' bak

Sıradan API kullanıcısı yalnızca text output görür. Profesyonel LLM mühendisi logprobs'la modelin kararsızlığını, alternatiflerini, confidence'ını okur. Bu, production-grade LLM ürünlerinde fark yaratan ayrıntı. 55 dakika sonra: production'da hallucination'ı erken yakalama, MCQ'larda %15+ accuracy artışı, prompt diagnostic — hepsini bileceksin.

Ders Haritası#

Logit, log-prob, probability üçlüsü
logprobs API: OpenAI vs Anthropic vs open-source
Token-level confidence scoring
Sequence-level likelihood
MCQ scoring: classification için logprobs
Hallucination detection logprobs ile
Token-level entropy ve "şüphe noktaları"
Anomaly detection: garip pattern'leri yakalama
Prompt diagnostics: hangi token bilgi vermiyor
Semantic confidence: birden çok generation'dan
Production monitoring: real-time dashboards
Limitler: logprobs ne söylemiyor

1. Logit, Log-Prob, Probability Üçlüsü#

Üç değer aslında aynı bilgi, farklı temsil:

	Range	Tipik kullanım
Logit (z)	(-∞, ∞)	Model output, internal
Probability (p)	(0, 1]	Karar verme, threshold
Log-probability (log p)	(-∞, 0]	Numerical stability, multiplication

Dönüşümler#

logit → probability: softmax(z)
probability → log_prob: log(p)
log_prob → probability: exp(log_p)
log_prob → logit: log_p + log(Z) where Z = sum(exp(z))   # ama Z genelde bilinmiyor

Niye log-prob?#

Numerical stability: log(0.001) = -6.9, basit; 0.001 underflow riski
Sum yerine product: P(x_1, x_2) = P(x_1) × P(x_2) → log: log(P_1) + log(P_2)
Loss = NLL = -log P — model eğitiminin doğal dili

Pratik#

API'lerin verdiği

logprob

değer log(softmax(logits))'tir. Yani:

logprob = -0.05
→
P ≈ 0.95
(çok confident)
logprob = -2.0
→
P ≈ 0.135
(hayli kararsız)
logprob = -10
→
P ≈ 0.00005
(model bunu seçmez)

2. logprobs API'leri — Karşılaştırma#

OpenAI#

resp = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[...],
    logprobs=True,
    top_logprobs=5,         # her token için top-5 alternatif
)
# resp.choices[0].logprobs.content

Detaylar:

top_logprobs
max 20
Her token için: token + logprob + bytes + top_logprobs
Streaming'de chunk-by-chunk

Anthropic#

2026 başı itibarıyla:

Native
logprobs
yok
Ama structured output ile workaround: schema'ya score field ekle, modelden self-evaluation iste
Cohere, Mistral API'sinde de logprobs gibi feature var

Open-source (Llama, Qwen)#

vLLM, transformers'da tam logits erişimi — istediğin her şey.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("...")
tok = AutoTokenizer.from_pretrained("...")

inputs = tok("prompt", return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)
    logits = out.logits        # (B, T, V)
    logprobs = torch.log_softmax(logits, dim=-1)

Tam V boyutlu logit/logprob vector → analiz için en güçlü.

3. Token-Level Confidence Scoring#

Bir token'ın "ne kadar emin" olduğunu logprob ile ölç.

Basit metric: log-prob#

confidence(token) = exp(logprob)   # [0, 1]

Daha bilgilendirici: top-k logprobs gap#

Eğer top-1 ve top-2 logprob'lar yakın ise → kararsız. Uzak ise → kesin.

margin = top_1_logprob - top_2_logprob

margin > 5 (probability ratio 150x+): çok emin
margin 1-3 (3x-20x ratio): orta
margin < 0.5 (1.5x ratio): kararsız

Entropy metric#

Bir token'ın olasılık dağılımının entropi'si (top-k yaklaşımı):

def token_entropy(top_logprobs):
    probs = [math.exp(lp) for lp in top_logprobs]
    # Normalize (top-k yaklaşımı)
    total = sum(probs)
    probs = [p/total for p in probs]
    return -sum(p * math.log(p + 1e-10) for p in probs)

entropy ~0: çok emin
entropy ~1 bit: 2 seçenek arası kararsızlık
entropy >2 bit: çok kararsız

python

# OpenAI logprobs ile token confidence
from openai import OpenAI
import math
 
client = OpenAI()
 
resp = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "What is the capital of Turkey?"}],
    logprobs=True,
    top_logprobs=10,
    max_tokens=15,
)
 
content = resp.choices[0].logprobs.content
 
print(f"{'Token':<20} {'logprob':>10} {'P':>8} {'Top-2 margin':>15} {'Entropy':>10}")
print("-" * 70)
for tok in content:
    chosen_lp = tok.logprob
    chosen_p = math.exp(chosen_lp)
    # Top-2 margin
    if len(tok.top_logprobs) >= 2:
        margin = tok.top_logprobs[0].logprob - tok.top_logprobs[1].logprob
    else:
        margin = float('inf')
    # Entropy (top-k approx)
    probs = [math.exp(t.logprob) for t in tok.top_logprobs]
    total = sum(probs)
    probs_norm = [p/total for p in probs]
    entropy = -sum(p * math.log(p + 1e-10) for p in probs_norm) / math.log(2)
 
    print(f"{repr(tok.token):<20} {chosen_lp:>10.4f} {chosen_p:>8.4f} {margin:>15.2f} {entropy:>10.4f}")

Token-level confidence + entropy hesabı.

4. Sequence-Level Likelihood#

Tek tek tokenlar yerine tam sequence'in likelihood'u:

log P(sequence) = Σ_t log P(x_t | x_<t)

Pratik metric:

sequence_logprob = sum(t.logprob for t in content)
avg_logprob = sequence_logprob / len(content)
perplexity = math.exp(-avg_logprob)

Yorumlar#

avg_logprob > -1: çok confident response (PPL < 2.7)
avg_logprob -1 to -2: normal (PPL 2.7-7.4)
avg_logprob < -3: model çok kararsız, şüphe et (PPL > 20)

Length-normalized#

Uzun sequence kısa'dan dezavantajlı (çarpım küçülüyor). Length normalization:

normalized_logprob = sequence_logprob / len(content) ** alpha   # alpha=0.7-1.0

Translation ve beam search'te yaygın trick.

5. MCQ Scoring — Classification için logprobs#

Multiple-choice question (MCQ) için klasik problem: cevap "A", "B", "C", "D"den hangisi?

Naif yaklaşım: generation + parse#

Soru: "Hangisi bir dağdır? A) Everest B) Akdeniz C) Sahra"
LLM: "A) Everest"
Parse: "A"

Sorun: model bazen "The answer is A" diye yazıyor, bazen tek harf. Parse fragile.

Logprobs yaklaşımı#

Cevap olarak sadece tek bir token (

" A"

" B"

" C"

" D"

) bekle. Her birinin logprob'unu al, en yüksek olanı seç.

# Cevap için 4 candidate token'ın logprob'unu al
candidates = [" A", " B", " C", " D"]
prompt = "Hangisi bir dağdır?\nA) Everest\nB) Akdeniz\nC) Sahra\n\nCevap:"

resp = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": prompt}],
    logprobs=True,
    top_logprobs=20,
    max_tokens=1,
)

# Top logprobs içinden candidate'leri bul
first_token = resp.choices[0].logprobs.content[0]
scores = {}
for top in first_token.top_logprobs:
    if top.token in candidates:
        scores[top.token] = top.logprob

# En yüksek logprob
best = max(scores, key=scores.get)
print(f"Cevap: {best.strip()}")

Avantajlar#

Robust parsing: hiç parse yok
Confidence score: olasılık olarak
Calibration: probabilities for downstream decisions
Cheap: max_tokens=1, çok az output token

Benchmark improvements#

Bu yaklaşımla MMLU, HellaSwag, ARC gibi MCQ benchmark'larda +%5-15 accuracy. Modül 53 (Evaluation) detayda.

python

# Production-grade MCQ scorer
from openai import OpenAI
import math
 
client = OpenAI()
 
def score_mcq(question, choices, model="gpt-5-mini"):
    """
    MCQ'yi logprobs ile skorla.
    choices: dict, key=letter, value=text
    """
    prompt = f"{question}\n"
    for k, v in choices.items():
        prompt += f"{k}) {v}\n"
    prompt += "\nCevap:"
 
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        logprobs=True,
        top_logprobs=20,
        max_tokens=1,
        temperature=0.0,
    )
 
    first_token = resp.choices[0].logprobs.content[0]
    # Hem " A" hem "A" formatları dene
    candidates_with_space = [f" {k}" for k in choices.keys()]
    candidates_without = list(choices.keys())
 
    scores = {}
    for top in first_token.top_logprobs:
        token_stripped = top.token.strip()
        if token_stripped in choices.keys():
            if token_stripped not in scores or top.logprob > scores[token_stripped]:
                scores[token_stripped] = top.logprob
 
    if not scores:
        return None, 0  # fallback
 
    # Softmax ile normalize → probabilities
    log_z = math.log(sum(math.exp(lp) for lp in scores.values()))
    probs = {k: math.exp(lp - log_z) for k, lp in scores.items()}
 
    best = max(probs, key=probs.get)
    confidence = probs[best]
    return best, confidence
 
# Test
result, conf = score_mcq(
    "Hangisi bir dağdır?",
    {"A": "Everest", "B": "Akdeniz", "C": "Sahra Çölü", "D": "Nil Nehri"},
)
print(f"Cevap: {result}, confidence: {conf:.4f}")

Production-grade MCQ scorer with logprobs.

6. Hallucination Detection — logprobs ile#

Hallucination'ın "early warning sign'larından" biri: cevabın token-level confidence düşük.

Heuristic 1: average logprob threshold#

avg_lp = sequence_logprob / len(tokens)
if avg_lp < -2.5:                    # equivalent to PPL > 12
    flag_as_potential_hallucination()

Heuristic 2: low-confidence token count#

low_conf_count = sum(1 for t in tokens if t.logprob < -3)
if low_conf_count > len(tokens) * 0.2:    # %20+ low-confidence
    flag()

Heuristic 3: named entity check#

Cevap özel isim/sayı içeriyorsa (Atatürk, 1881, Selanik) — bu token'ların logprob'u çok düşükse hallucination şüphesi.

# Hugging Face NER ile entity extract
entities = ner_pipeline(text)
for ent in entities:
    # Token range bul
    tokens_in_ent = find_tokens_in_range(ent["start"], ent["end"], content)
    avg_lp = mean(t.logprob for t in tokens_in_ent)
    if avg_lp < -3:
        flag(f"Suspicious entity: {ent['word']}")

Heuristic 4: SelfCheckGPT (Manakul 2023)#

Sample multiple times, check consistency:

1. Aynı prompt'a N=10 sample (T=0.7)
2. Pairwise NLI veya BERTScore similarity
3. Cevapların %80'i tutarsız → hallucination şüphesi

Production combo#

Çoklu heuristic'i ensemble olarak kullan:

score = 0
if avg_logprob < -2: score += 1
if low_conf_ratio > 0.2: score += 1
if entity_low_conf: score += 2
if self_check_consistency < 0.7: score += 2

if score >= 3:
    flag_high_hallucination_risk()

Türkçe için#

Türkçe metin tipik olarak daha düşük logprob alır (PPL daha yüksek). Threshold'ları dataset'ten kalibre et:

İngilizce: avg_lp threshold = -2.5
Türkçe:    avg_lp threshold = -3.5

7. Token-Level Entropy ve Şüphe Noktaları#

Cevabın hangi token'ında model en kararsız?

def find_uncertainty_hotspots(content, threshold=1.5):
    """
    Cevapta yüksek entropi noktalarını bul.
    """
    hotspots = []
    for i, tok in enumerate(content):
        # Top-k entropy
        probs = [math.exp(t.logprob) for t in tok.top_logprobs]
        total = sum(probs)
        probs = [p/total for p in probs]
        entropy = -sum(p * math.log2(p + 1e-10) for p in probs)
        if entropy > threshold:
            hotspots.append({
                "position": i,
                "token": tok.token,
                "entropy": entropy,
                "alternatives": [(t.token, math.exp(t.logprob)) for t in tok.top_logprobs[:3]]
            })
    return hotspots

Pratik kullanım#

Citation enforcement: hotspot tokens'larda factual claim varsa, RAG ile verify
Human-in-the-loop: yüksek entropi cevaplar review için flag
Active learning: bu örnekleri fine-tune dataset'ine ekle (uncertainty-based sampling)

8. Anomaly Detection — Garip Pattern'leri Yakalama#

Bir cevabın anomaly göstergeleri:

1. Sudden confidence drop#

Normal: -0.5, -0.8, -0.3, -1.0, -0.6...
Anomaly: -0.5, -0.8, -0.3, -8.5, -0.6...    # tek bir token -8.5

Glitch token, OOV, ya da model "kafası karıştı" göstergesi.

2. Sustained low confidence#

Normal: -0.5, -0.8, -0.3, -1.0...
Anomaly: -3.5, -4.2, -3.8, -4.5, -3.9...    # tüm cevap düşük

Model OOD content üretmeye çalışıyor — hallucination veya rare topic.

3. Repetition pattern#

Aynı token sequence tekrar ediyor —

logprob

increase each repetition (model giderek daha emin tekrar etmekten). Loop indicator.

4. Glitch tokens#

Top_logprobs içinde bilinmeyen / garip karakterli token görünüyor. SolidGoldMagikarp tarzı.

Production dashboard#

class ResponseMetrics:
    avg_logprob: float
    min_logprob: float
    max_token_entropy: float
    sudden_drops: int          # logprob diff > 5 between adjacent
    low_conf_tokens_pct: float
    hallucination_score: int   # 0-5

Bu metrikler real-time monitoring dashboard'a girer. Anomaly threshold aşıldığında alert.

9. Prompt Diagnostics — Hangi Token Bilgi Vermiyor?#

Bir prompt çok uzunsa, hangi token'ı çıkarırsan çıktıyı etkilemiyor? Information attribution.

Yöntem: leave-one-out logprob#

def prompt_attribution(prompt, expected_response, model):
    """
    Her token'ı 'sil' ve response'un logprob'una etkisini ölç.
    Etkisi büyük olanlar 'önemli', küçük olanlar 'gereksiz'.
    """
    full_lp = get_response_logprob(prompt, expected_response)

    tokens = tokenize(prompt)
    attributions = []
    for i in range(len(tokens)):
        masked = tokens[:i] + tokens[i+1:]
        masked_lp = get_response_logprob(detokenize(masked), expected_response)
        delta = full_lp - masked_lp
        attributions.append({"token": tokens[i], "importance": delta})

    return sorted(attributions, key=lambda x: -x["importance"])

Pratik#

Uzun prompt'larda gereksiz token'ları ortaya çıkarır → prompt compression. LLMLingua 2 ve benzeri tool'lar bu attribution'ı otomatikleştirip prompt'u küçültür.

Modül bağlantısı#

Modül 47 (Cost Engineering)'de prompt compression'ı detaylandırıyoruz. Logprobs-based attribution temel tekniklerden biri.

10. Semantic Confidence — Multi-Sample Tutarlılık#

Tek bir generation'ın logprob'u "token-level" confidence verir. Semantic confidence ise: aynı soruya birden çok sample alıp tutarlılığı ölçmek.

SelfCheck-NLI (Manakul 2023)#

1. Aynı prompt'a N=5 sample (T=0.7)
2. Her sample içindeki **factual claim'leri** çıkar
3. Diğer N-1 sample'ı reference olarak kullan
4. Her claim için: kaç reference onayla / kaç reddediyor (NLI ile)
5. Score = onaylama_orani

SelfCheck-BERTScore#

Daha basit varyant: N sample arasında pairwise BERTScore similarity.

Production#

High-stakes Q&A: minimum 5 sample
Threshold: %80 tutarlılık → güvenilir
< %50 → hallucination probable

Maliyet#

5x cost. Hangi senaryolarda gerek?

Medical: zorunlu
Legal: zorunlu
Customer support: tartışmalı
Casual chat: gereksiz

Modül 53 (Evaluation) ve Modül 56 (Safety) bu konuları detaylandırıyor.

11. Production Monitoring — Real-Time Dashboards#

Production LLM uygulamalarında observability kritik. Langfuse, Helicone, Phoenix, Arize tarzı tool'lar logprobs'u trace'lere entegre ediyor.

Önerilen metric'ler#

Metric	Threshold	Action
Avg logprob	< -3	Flag, log for review
Max token entropy	> 3 bits	Investigate
Low conf token ratio	> 25%	Re-query with RAG
Sudden drop count	> 0	Check for glitch
Response perplexity	> 15	Hallucination suspect

Dashboard panels#

Time-series: avg logprob over time (per hour/day)
Histogram: response logprob distribution
Heatmap: token position × confidence (hangi pozisyonda kararsızlık?)
Alert feed: real-time anomaly detection

Langfuse pattern#

from langfuse import Langfuse
langfuse = Langfuse(public_key=..., secret_key=...)

trace = langfuse.trace(name="llm-response")
generation = trace.generation(
    name="primary-response",
    model="gpt-5-mini",
    input=prompt,
    output=response_text,
    usage={"input_tokens": ..., "output_tokens": ...},
    metadata={
        "avg_logprob": avg_lp,
        "low_conf_count": low_conf,
        "entropy_max": max_entropy,
        "hallucination_score": h_score,
    },
)

Modül 48 (Observability) detayda işliyor.

12. logprobs'un Söylemediği#

logprobs çok değerli ama her şeyi söylemiyor:

1. Internal "reasoning"#

o1, R1 gibi reasoning model'lar internal CoT'da düşünüyor — bu API'ye verilmiyor (genelde). Sadece final answer logprob'u görünüyor.

2. Cross-token dependency#

Tek bir token'ın logprob'u o token'ın conditional probability'si. Cümle bütünündeki konsept "doğruluğu" değil.

"Ankara is in Spain." → tüm tokenlar yüksek logprob alır
ama anlam yanlış

3. Calibration#

logprob = 0.95 ≠ %95 doğruluk. Modul'in calibration'ına bağlı (Modül 4.1).

4. Confidence vs correctness#

Model emin olabilir ama yanlış. Confidence sadece "training distribution'da bu pattern'i tutarlı gördüm" demek.

5. Causal vs spurious patterns#

Model bir token'a yüksek logprob veriyor olabilir çünkü prompt'taki keyword ile correlate ediyor — gerçek mantıksal sebep değil.

Bilinçli kullanım#

logprobs bir göstergedir, the truth değil. Diğer signal'lerle (RAG verification, tool use, multi-sample) birleştir.

13. Mini Egzersizler#

logprob → probability: logprob = -2.3, hangi probability'e karşılık? Sequence'in logprob toplamı -23, length 10. Perplexity?
MCQ scoring pratik: 4 seçenekli soruya cevap olarak A=-0.5, B=-1.2, C=-2.0, D=-3.1 logprob. Probabilities? Confidence?
Hallucination heuristic: Avg logprob -2.8, entity logprobs [-0.5, -1.0, -4.5]. Hallucination şüphesi seviyesi?
Token entropy hesabı: Top-3 logprobs [-0.5, -2.0, -3.5]. Token entropy?
SelfCheck threshold: 10 sample'dan 6'sı tutarlı, 4'ü farklı. Bu cevaba güvenir misin? Threshold önerisi?

Bu Derste Neler Öğrendik?#

✓ Logit, log-prob, probability üçlüsü ve dönüşümleri ✓ logprobs API: OpenAI vs Anthropic vs open-source ✓ Token-level confidence — logprob, margin, entropy ✓ Sequence-level likelihood ve length-normalized variants ✓ MCQ scoring — %5-15 accuracy improvement ✓ Hallucination detection — 4 heuristic + ensemble ✓ Uncertainty hotspots — yüksek entropi pozisyonları ✓ Anomaly detection — sudden drops, sustained low conf, glitch ✓ Prompt diagnostics — leave-one-out attribution ✓ Semantic confidence — SelfCheck-NLI multi-sample ✓ Production monitoring — Langfuse + dashboard pattern ✓ logprobs'un sınırları — calibration, causal vs spurious

Sıradaki Ders#

4.5 — In-Context Learning'in Matematiği: Implicit Bayesian Inference GPT-3'ün "few-shot learning" yeteneği büyüsünün matematiksel açıklaması. Implicit Bayesian inference hipotezi, induction heads, mechanistic interpretability ipuçları. Modern LLM'in en gizemli emergent capability'sinin bilimsel çerçevesi.

Frequently Asked Questions

Yes, supported since 2024. Each chunk includes token + logprob + top_logprobs. **Performance**: small overhead (~5% latency increase); top_logprobs=20 increases network bandwidth. Production: enable where needed, default off. Cost unchanged (output tokens same).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. Logit, Log-Prob, Probability Üçlüsü#

Dönüşümler#

Niye log-prob?#

Pratik#

2. logprobs API'leri — Karşılaştırma#

OpenAI#

Anthropic#

Open-source (Llama, Qwen)#

3. Token-Level Confidence Scoring#

Basit metric: log-prob#

Daha bilgilendirici: top-k logprobs gap#

Entropy metric#

4. Sequence-Level Likelihood#

Yorumlar#

Length-normalized#

5. MCQ Scoring — Classification için logprobs#

Naif yaklaşım: generation + parse#

Logprobs yaklaşımı#

Avantajlar#

Benchmark improvements#

6. Hallucination Detection — logprobs ile#

Heuristic 1: average logprob threshold#

Heuristic 2: low-confidence token count#

Heuristic 3: named entity check#

Heuristic 4: SelfCheckGPT (Manakul 2023)#

Production combo#

Türkçe için#

7. Token-Level Entropy ve Şüphe Noktaları#

Pratik kullanım#

8. Anomaly Detection — Garip Pattern'leri Yakalama#

1. Sudden confidence drop#

2. Sustained low confidence#

3. Repetition pattern#

4. Glitch tokens#

Production dashboard#

9. Prompt Diagnostics — Hangi Token Bilgi Vermiyor?#

Yöntem: leave-one-out logprob#

Pratik#

Modül bağlantısı#

10. Semantic Confidence — Multi-Sample Tutarlılık#

SelfCheck-NLI (Manakul 2023)#

SelfCheck-BERTScore#

Production#

Maliyet#

11. Production Monitoring — Real-Time Dashboards#

Önerilen metric'ler#

Dashboard panels#

Langfuse pattern#

12. logprobs'un Söylemediği#

1. Internal "reasoning"#

2. Cross-token dependency#

3. Calibration#

4. Confidence vs correctness#

5. Causal vs spurious patterns#

Bilinçli kullanım#

13. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

Can streaming be used with OpenAI logprobs?

If Anthropic has no logprobs, how to detect hallucinations?

logprobs suggest the response is reliable, but it's wrong. Why?

In MCQ scoring, why is the answer token ' A' (with space) instead of 'A'?

Isn't 3 samples enough instead of 5 in SelfCheck-NLI?

How to calibrate logprobs thresholds for Turkish?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight