-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for ByteLevel encoding in the BPE tokenizer to enhance compatibility with the DeepSeek model. Key changes include:
- Introducing a new BpeOptions type with a ByteLevel property.
- Adding a CompositePreTokenizer implementation to apply multiple pre-tokenizers sequentially.
- Refactoring and centralizing byte array appending logic in Helpers, along with updating the ToTokens method in Word to support mapping.
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs | Adds a new options class for BPE tokenizers, including ByteLevel encoding support. |
src/Microsoft.ML.Tokenizers/PreTokenizer/CompositePreTokenizer.cs | Implements a composite pre-tokenizer with special tokens handling. |
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs | Introduces AppendToBytesArray to centralize byte transformation logic. |
src/Microsoft.ML.Tokenizers/Model/Word.cs | Updates the ToTokens method to incorporate an optional mapping parameter. |
src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs | Refactors code to use the centralized Helpers.AppendToBytesArray method. |
Files not reviewed (1)
- eng/Versions.props: Language not supported
Co-authored-by: Copilot <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7425 +/- ##
==========================================
+ Coverage 68.97% 68.99% +0.01%
==========================================
Files 1481 1483 +2
Lines 273708 274563 +855
Branches 28285 28395 +110
==========================================
+ Hits 188789 189431 +642
- Misses 77525 77694 +169
- Partials 7394 7438 +44
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
@@ -226,6 +287,9 @@ public static async Task<BpeTokenizer> CreateAsync( | |||
/// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param> | |||
/// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param> | |||
/// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param> | |||
/// <param name="byteLevel">Indicate whether to handle the input text in byte level.</param> | |||
/// <param name="beginningOfSentenceToken">The beginning of sentence token.</param> | |||
/// <param name="endOfSentenceToken">The end of sentence token.</param> | |||
private BpeTokenizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a good idea to have a constructor here that takes a BpeOptions
so you don't have to change the signature everytime you have something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require to have the existing BpeTokenizer.Create(..., ..., ...,..)
to always create the options object to call that constructor. It may not be a big deal to do so but it will need some work to refactor that as we read the vocabs and merge differently in different cases.
This change includes:
BpeTokenizer.Create
method that allows creating a tokenizer using the newly introducedBpeOptions
type. This enables users to obtain tokenizer data from various sources (e.g.,tokenizer.json
files) and instantiate the tokenizer using this new factory method.tokenizer.json
, which can be used as a template for loading other models' tokenizers when needed.