Skip to content
Technical GlossaryNatural Language Processing

Byte Pair Encoding

A tokenization method that builds a data-driven subword vocabulary by learning frequent subunit merges.

Byte Pair Encoding is one of the most widely known subword tokenization methods. It starts from small units and gradually merges the most frequent co-occurring pieces. This allows it to build an efficient vocabulary while still keeping rare words decomposable. It offers a strong practical balance, especially in generative language models and open-ended text systems.