Red-Teaming Lab: GCG + PAIR + AutoDAN + Prompt Injection Robustness
Mandatory before production deploy: red-team probe. GCG (Greedy Coordinate Gradient — adversarial suffix attack), PAIR (Prompt Automatic Iterative Refinement — LLM attacks LLM), AutoDAN (jailbreak auto-generation), prompt injection (malicious instruction in RAG context). Cookbook's open red-team corpus + scoring method.
Şükrü Yusuf KAYA
30 min read
Advanced1. Red-Team Attack Tipleri#
| Attack | Yöntem | Zorluk |
|---|---|---|
| Manual jailbreak | Insan tarafından prompt yazılır ("DAN", "AIM") | düşük |
| Roleplay | "Sen bir hacker'sın" | orta |
| GCG (Zou 2023) | Gradient-based suffix optimization | yüksek (whitebox) |
| PAIR (Chao 2023) | LLM-vs-LLM iterative refinement | yüksek |
| AutoDAN (Liu 2024) | Genetic algorithm + LLM | yüksek |
| Prompt injection | RAG context'inde "ignore previous" instructions | orta |
| Multilingual | TR prompt + AR/RU obfuscation | orta |
Cookbook'un kuralı: Production deploy öncesi en az 4 attack tipinde test, ASR (Attack Success Rate) < %5 olmalı.
✅ Teslim
- HarmBench veya AdvBench dataset'i indir. 2) FT model üzerinde GCG attack koş. 3) ASR ölç. 4) Sonraki ders: 18.8 — Watermarking & Provenance.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations
Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes
Start LearningConnected pillar topics