Skip to content

Support ByteLevel encoding in Bpe tokenizer to support DeepSeek model #7425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 19, 2025

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Mar 18, 2025

This change includes:

  • Adding support for ByteLevel encoding in the BPE tokenizer.
  • Introducing a new BpeTokenizer.Create method that allows creating a tokenizer using the newly introduced BpeOptions type. This enables users to obtain tokenizer data from various sources (e.g., tokenizer.json files) and instantiate the tokenizer using this new factory method.
  • Expanding test coverage to validate ByteLevel support.
  • Adding a test that loads the real tokenizer.json for the DeepSeek R1 model and verifies its functionality. Since the DeepSeek tokenizer utilizes ByteLevel encoding, it serves as an ideal test case for this feature.
  • Providing test code for loading tokenizer.json, which can be used as a template for loading other models' tokenizers when needed.
  • Introducing the CompositePreTokenizer to support the DeepSeek pre-tokenization scenario.

@Copilot Copilot AI review requested due to automatic review settings March 18, 2025 21:51
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for ByteLevel encoding in the BPE tokenizer to enhance compatibility with the DeepSeek model. Key changes include:

  • Introducing a new BpeOptions type with a ByteLevel property.
  • Adding a CompositePreTokenizer implementation to apply multiple pre-tokenizers sequentially.
  • Refactoring and centralizing byte array appending logic in Helpers, along with updating the ToTokens method in Word to support mapping.

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs Adds a new options class for BPE tokenizers, including ByteLevel encoding support.
src/Microsoft.ML.Tokenizers/PreTokenizer/CompositePreTokenizer.cs Implements a composite pre-tokenizer with special tokens handling.
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs Introduces AppendToBytesArray to centralize byte transformation logic.
src/Microsoft.ML.Tokenizers/Model/Word.cs Updates the ToTokens method to incorporate an optional mapping parameter.
src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs Refactors code to use the centralized Helpers.AppendToBytesArray method.
Files not reviewed (1)
  • eng/Versions.props: Language not supported

@tarekgh tarekgh added this to the ML.NET 5.0 milestone Mar 18, 2025
@tarekgh tarekgh requested a review from michaelgsharp March 18, 2025 21:55
@tarekgh
Copy link
Member Author

tarekgh commented Mar 18, 2025

Copy link

codecov bot commented Mar 18, 2025

Codecov Report

Attention: Patch coverage is 75.08091% with 231 lines in your changes missing coverage. Please review.

Project coverage is 68.99%. Comparing base (adad40c) to head (39143c0).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 62.13% 96 Missing and 32 partials ⚠️
...L.Tokenizers/PreTokenizer/CompositePreTokenizer.cs 40.15% 70 Missing and 6 partials ⚠️
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 94.59% 13 Missing and 9 partials ⚠️
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs 84.21% 2 Missing and 1 partial ⚠️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7425      +/-   ##
==========================================
+ Coverage   68.97%   68.99%   +0.01%     
==========================================
  Files        1481     1483       +2     
  Lines      273708   274563     +855     
  Branches    28285    28395     +110     
==========================================
+ Hits       188789   189431     +642     
- Misses      77525    77694     +169     
- Partials     7394     7438      +44     
Flag Coverage Δ
Debug 68.99% <75.08%> (+6.38%) ⬆️
production 63.26% <59.80%> (+0.65%) ⬆️
test 89.49% <94.59%> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/Word.cs 63.80% <100.00%> (+2.38%) ⬆️
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs 72.40% <100.00%> (+1.76%) ⬆️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 72.65% <50.00%> (-0.11%) ⬇️
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs 84.21% <84.21%> (ø)
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 97.31% <94.59%> (-2.69%) ⬇️
...L.Tokenizers/PreTokenizer/CompositePreTokenizer.cs 40.15% <40.15%> (ø)
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 71.98% <62.13%> (-5.03%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -226,6 +287,9 @@ public static async Task<BpeTokenizer> CreateAsync(
/// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param>
/// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param>
/// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param>
/// <param name="byteLevel">Indicate whether to handle the input text in byte level.</param>
/// <param name="beginningOfSentenceToken">The beginning of sentence token.</param>
/// <param name="endOfSentenceToken">The end of sentence token.</param>
private BpeTokenizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a good idea to have a constructor here that takes a BpeOptions so you don't have to change the signature everytime you have something like this?

Copy link
Member Author

@tarekgh tarekgh Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require to have the existing BpeTokenizer.Create(..., ..., ...,..) to always create the options object to call that constructor. It may not be a big deal to do so but it will need some work to refactor that as we read the vocabs and merge differently in different cases.

@tarekgh tarekgh merged commit d4f690c into dotnet:main Mar 19, 2025
25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 19, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants