quicktok: Fast and Efficient BPE Tokenizer

Overview of quicktok

After extensive development, quicktok has arrived as a powerful solution for those looking to expedite their tokenization processes. This fast and precise BPE tokenizer is written in C++ and is fully compatible with tiktoken, boasting byte-identical token IDs. Notably, quicktok operates 2 to 3.6 times quicker than bpe-openai—the fastest known alternative—and achieves speeds 4 to 11 times faster than tiktoken itself.

Key Features

quicktok supports various encoding schemes including cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3. Its design employs the same algorithm as bpe-openai, utilizing exact backtracking BPE, but incorporates advanced data structure optimizations to minimize memory access:

2-byte Trie: Used for efficient longest-match walks.
Dense Caches: Implemented for validating merges effectively.
Hand-Compiled Pretokenizer: Utilizes a specialized solution rather than a general regex engine for improved performance.

Performance Benchmarks

In rigorous testing conducted on an Apple M1 chip using a single thread, quicktok demonstrated impressive throughput rates measured in MB/s. The benchmarks reveal the following results across various datasets:

Encoder	The Pile	Code	Common Crawl
quicktok (native)	121.7	139.2	71.3
quicktok (Python)	77.9	83.6	49.7
bpe-openai	36.6	38.7	28.9
rs-bpe	30.9	34.7	23.5
tiktoken-rs	15.4	13.8	13.3
tiktoken (Python)	13.6	12.8	12.3
TokenDagger	11.1	11.9	10.7

Each encoder is accessed via its own raw API, and the benchmarks can be replicated by using the make bench-compare command in the repository.

Conclusion

For anyone in need of a faster and more efficient tokenization solution, quicktok presents an outstanding option. You can install it via pip with pip install quicktok-v1 and explore the project further on GitHub: quicktok Repo.

Материал подготовлен AI-редакцией и проверен редактором.

Introducing quicktok: The Next-Gen Tokenizer for Speed and Efficiency

Overview of quicktok

Key Features

Performance Benchmarks

Conclusion

Related articles