Long-CoT Stability: Repetition Collapse + Think-Loop Mitigation

Reasoning model's most common bug: **think-loop** — model keeps thinking same thing. Repetition collapse, length explosion (8K → 30K). Mitigation: entropy bonus, repetition penalty during training, max_think_tokens enforcement, reward shaping (length penalty), early-stopping heuristics.

Şükrü Yusuf KAYA

24 min read

5/14/2026

Advanced

Long-CoT Stability: Repetition Collapse + Think-Loop Mitigation

1. Think-Loop Pattern'ları#

<think>
Soruyu çözmek için... bir saniye, önce şunu düşüneyim...
Aslında bir önceki düşüncemi gözden geçireyim...
Aslında durumu tekrar değerlendireyim...
Durumu tekrar düşüneyim ...
[3000 token sonra]
Durumu tekrar düşüneyim...
</think>

Tespit: n-gram repetition rate. 4-gram'ın 5+ kez tekrar etmesi → loop riski.

Mitigation training-time:

GRPO reward'a length penalty:
-0.001 × max(0, len - 2000)
Repetition penalty:
-0.5 × n_gram_repetition_rate
Entropy bonus:
+0.01 × entropy(action_distribution)

Mitigation inference-time:

max_new_tokens=8192
hard cap
repetition_penalty=1.1
HF generation
Custom logits processor: 4-gram repetition'da
</think>
token'ı boost et

✅ Teslim

Trained reasoning model'in inference'ında 100 problem üzerinde think length distribution çıkar. 2) Outlier'ları (>10K think tokens) incele. 3) Sonraki ders: 12.6 — Reasoning Eval.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Long-CoT Stability: Repetition Collapse + Think-Loop Mitigation

1. Think-Loop Pattern'ları#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter