Skip to content

Red-Teaming Lab: GCG + PAIR + AutoDAN + Prompt Injection Robustness

Mandatory before production deploy: red-team probe. GCG (Greedy Coordinate Gradient — adversarial suffix attack), PAIR (Prompt Automatic Iterative Refinement — LLM attacks LLM), AutoDAN (jailbreak auto-generation), prompt injection (malicious instruction in RAG context). Cookbook's open red-team corpus + scoring method.

Şükrü Yusuf KAYA
30 min read
Advanced
Red-Teaming Lab: GCG + PAIR + AutoDAN + Prompt Injection Robustness

1. Red-Team Attack Tipleri#

AttackYöntemZorluk
Manual jailbreakInsan tarafından prompt yazılır ("DAN", "AIM")düşük
Roleplay"Sen bir hacker'sın"orta
GCG (Zou 2023)Gradient-based suffix optimizationyüksek (whitebox)
PAIR (Chao 2023)LLM-vs-LLM iterative refinementyüksek
AutoDAN (Liu 2024)Genetic algorithm + LLMyüksek
Prompt injectionRAG context'inde "ignore previous" instructionsorta
MultilingualTR prompt + AR/RU obfuscationorta
Cookbook'un kuralı: Production deploy öncesi en az 4 attack tipinde test, ASR (Attack Success Rate) < %5 olmalı.
✅ Teslim
  1. HarmBench veya AdvBench dataset'i indir. 2) FT model üzerinde GCG attack koş. 3) ASR ölç. 4) Sonraki ders: 18.8 — Watermarking & Provenance.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content

Connected pillar topics

Pillar topics this article maps to