Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425

tarekgh · 2025-03-18T21:50:59Z

This change includes:

Adding support for ByteLevel encoding in the BPE tokenizer.
Introducing a new BpeTokenizer.Create method that allows creating a tokenizer using the newly introduced BpeOptions type. This enables users to obtain tokenizer data from various sources (e.g., tokenizer.json files) and instantiate the tokenizer using this new factory method.
Expanding test coverage to validate ByteLevel support.
Adding a test that loads the real tokenizer.json for the DeepSeek R1 model and verifies its functionality. Since the DeepSeek tokenizer utilizes ByteLevel encoding, it serves as an ideal test case for this feature.
Providing test code for loading tokenizer.json, which can be used as a template for loading other models' tokenizers when needed.
Introducing the CompositePreTokenizer to support the DeepSeek pre-tokenization scenario.

Copilot

Pull Request Overview

This PR adds support for ByteLevel encoding in the BPE tokenizer to enhance compatibility with the DeepSeek model. Key changes include:

Introducing a new BpeOptions type with a ByteLevel property.
Adding a CompositePreTokenizer implementation to apply multiple pre-tokenizers sequentially.
Refactoring and centralizing byte array appending logic in Helpers, along with updating the ToTokens method in Word to support mapping.

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs	Adds a new options class for BPE tokenizers, including ByteLevel encoding support.
src/Microsoft.ML.Tokenizers/PreTokenizer/CompositePreTokenizer.cs	Implements a composite pre-tokenizer with special tokens handling.
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs	Introduces AppendToBytesArray to centralize byte transformation logic.
src/Microsoft.ML.Tokenizers/Model/Word.cs	Updates the ToTokens method to incorporate an optional mapping parameter.
src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs	Refactors code to use the centralized Helpers.AppendToBytesArray method.

Files not reviewed (1)

eng/Versions.props: Language not supported

src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs

Co-authored-by: Copilot <[email protected]>

tarekgh · 2025-03-18T21:56:04Z

FYI @ericstj @luisquintanilla @stephentoub

codecov · 2025-03-18T23:24:48Z

Codecov Report

Attention: Patch coverage is 75.08091% with 231 lines in your changes missing coverage. Please review.

Project coverage is 68.99%. Comparing base (adad40c) to head (39143c0).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs	62.13%	96 Missing and 32 partials ⚠️
...L.Tokenizers/PreTokenizer/CompositePreTokenizer.cs	40.15%	70 Missing and 6 partials ⚠️
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs	94.59%	13 Missing and 9 partials ⚠️
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs	84.21%	2 Missing and 1 partial ⚠️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7425      +/-   ##
==========================================
+ Coverage   68.97%   68.99%   +0.01%     
==========================================
  Files        1481     1483       +2     
  Lines      273708   274563     +855     
  Branches    28285    28395     +110     
==========================================
+ Hits       188789   189431     +642     
- Misses      77525    77694     +169     
- Partials     7394     7438      +44

Flag	Coverage Δ
Debug	`68.99% <75.08%> (+6.38%)`	⬆️
production	`63.26% <59.80%> (+0.65%)`	⬆️
test	`89.49% <94.59%> (∅)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/Microsoft.ML.Tokenizers/Model/Word.cs	`63.80% <100.00%> (+2.38%)`	⬆️
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs	`72.40% <100.00%> (+1.76%)`	⬆️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs	`72.65% <50.00%> (-0.11%)`	⬇️
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs	`84.21% <84.21%> (ø)`
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs	`97.31% <94.59%> (-2.69%)`	⬇️
...L.Tokenizers/PreTokenizer/CompositePreTokenizer.cs	`40.15% <40.15%> (ø)`
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs	`71.98% <62.13%> (-5.03%)`	⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

michaelgsharp · 2025-03-19T16:35:47Z

src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs

@@ -226,6 +287,9 @@ public static async Task<BpeTokenizer> CreateAsync(
        /// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param>
        /// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param>
        /// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param>
+        /// <param name="byteLevel">Indicate whether to handle the input text in byte level.</param>
+        /// <param name="beginningOfSentenceToken">The beginning of sentence token.</param>
+        /// <param name="endOfSentenceToken">The end of sentence token.</param>
        private BpeTokenizer(


Would it be a good idea to have a constructor here that takes a BpeOptions so you don't have to change the signature everytime you have something like this?

This will require to have the existing BpeTokenizer.Create(..., ..., ...,..) to always create the options object to call that constructor. It may not be a big deal to do so but it will need some work to refactor that as we read the vocabs and merge differently in different cases.

src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs

src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs

Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model

10a1797

Copilot AI review requested due to automatic review settings March 18, 2025 21:51

dotnet-policy-service bot assigned tarekgh Mar 18, 2025

Copilot AI reviewed Mar 18, 2025

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs Outdated Show resolved Hide resolved

tarekgh added this to the ML.NET 5.0 milestone Mar 18, 2025

tarekgh added the Tokenizers label Mar 18, 2025

Update src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs

5371dab

Co-authored-by: Copilot <[email protected]>

tarekgh requested a review from michaelgsharp March 18, 2025 21:55

michaelgsharp reviewed Mar 19, 2025

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs Outdated Show resolved Hide resolved

michaelgsharp reviewed Mar 19, 2025

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs Outdated Show resolved Hide resolved

michaelgsharp reviewed Mar 19, 2025

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs Outdated Show resolved Hide resolved

Feedback 1

39143c0

michaelgsharp approved these changes Mar 19, 2025

View reviewed changes

tarekgh merged commit d4f690c into dotnet:main Mar 19, 2025
25 checks passed

github-actions bot locked and limited conversation to collaborators Apr 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425

Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425

Uh oh!

tarekgh commented Mar 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tarekgh commented Mar 18, 2025

Uh oh!

codecov bot commented Mar 18, 2025 •

edited

Loading

Uh oh!

michaelgsharp Mar 19, 2025

Uh oh!

tarekgh Mar 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425

Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425

Uh oh!

Conversation

tarekgh commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

tarekgh commented Mar 18, 2025

Uh oh!

codecov bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

michaelgsharp Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

tarekgh Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tarekgh commented Mar 18, 2025 •

edited

Loading

codecov bot commented Mar 18, 2025 •

edited

Loading

tarekgh Mar 19, 2025 •

edited

Loading