# Dataset Quality Pipeline: MinHash Dedupe + Perplexity Filter + Toxicity + Educational-Value

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-dataset-quality-pipeline-minhash-perplexity
> Updated: 2026-05-14T14:42:50.627Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part II — Tokenizer & Data Engineering
**TLDR:** Garbage in, garbage out. SFT dataset quality pipeline: MinHash LSH for near-duplicates (~30-40% are duplicates), KenLM 5-gram perplexity filter, HateBERT-TR toxicity, FineWeb-style educational-value scorer. Clean 1M-row TR dataset in 25 min on RTX 4090.

