Open
Description
Goal: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models.
Committed:
- Add support for more commonly used Tokenizers
- TikToken Introducing Tiktoken Tokenizer #6981
- LlamaTokenizer & SentencePiece algorithm [Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm #6987
- CodeGenTokenizer & Byte-level BPE [Tokenizers] Port CodeGenTokenizer & byte-level BPE algorithm #6992
- WordPiece algorithm [Tokenizers] Implement WordPiece algorithm #6988
- BERTTokenizer [Tokenizers] Port BERTTokenizers #6991
- Measure and improve performance of Tokenizers API - making breaking changes where necessary. (Track Tokenizers design feedback #6982)
- Explore existing construction patterns to improve usability - both in factory API and load from configuration.
- Drive adoption of Microsoft.ML.Tokenizers in other libraries
- Docs and samples
Backlog:
- Investigate using Microsoft.ML.Tokenizers in Azure OpenAI SDK
- Sentencepiece Unigram Implement Sentencepiece Unigram tokenizer #7186
- CLIP Tokenizer [Tokenizers] Port CLIP Tokenizer #6993