Skip to content

Introducing SentencePiece Unigram Tokenizer Model #7390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Feb 18, 2025

Fixes #7186

We have been supporting SentencePiece Bpe model for a while, the change here is introducing the support to the SentencePiece Unigram tokenizer model.

Users can create a tokenizer instance by using a code like the following:

// download the tokenizer.model from the LLM model site like the following:            
// @"https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/sentencepiece.bpe.model?download=true";
using Stream remoteStream = File.OpenRead(path/to/the/tokenizer/model/file);
SentencePieceTokenizer tokenizer = SentencePieceTokenizer.Create(remoteStream);

Notes around the change

  • Have done some refactoring code existed in the SentencePieceToknizer.cs file into a newly introduced file SentencePieceBaseModel.cs and SentencePieceBpeModel.cs. Most of the code in the newly introduced files are mostly not changed.
  • Added the new Unigram implementation into the SentencePieceUnigramModel.cs file.
  • Changed SentencePieceTokenizer.cs to work with the model abstraction and automatically handle both models Bpe and Unigram.
  • The change shouldn't introduce any behavior change to the existing model (i.e Bpe).

@Copilot Copilot AI review requested due to automatic review settings February 18, 2025 20:10
@tarekgh
Copy link
Member Author

tarekgh commented Feb 18, 2025

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.

Files not reviewed (3)
  • THIRD-PARTY-NOTICES.TXT: Language not supported
  • eng/Versions.props: Language not supported
  • src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs: Evaluated as low risk

@tarekgh tarekgh added this to the ML.NET 5.0 milestone Feb 18, 2025
Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 75.07347% with 933 lines in your changes missing coverage. Please review.

Project coverage is 68.94%. Comparing base (e3219a9) to head (d244042).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 60.18% 293 Missing and 47 partials ⚠️
...osoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs 77.43% 158 Missing and 32 partials ⚠️
...c/Microsoft.ML.Tokenizers/Utils/DoubleArrayTrie.cs 82.66% 107 Missing and 19 partials ⚠️
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs 79.04% 73 Missing and 32 partials ⚠️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs 56.77% 61 Missing and 22 partials ⚠️
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs 37.77% 56 Missing ⚠️
...t.ML.Tokenizers/Utils/OrdinalUtf8StringComparer.cs 59.61% 16 Missing and 5 partials ⚠️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs 97.10% 6 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7390      +/-   ##
==========================================
+ Coverage   68.88%   68.94%   +0.06%     
==========================================
  Files        1473     1480       +7     
  Lines      270814   273273    +2459     
  Branches    27884    28234     +350     
==========================================
+ Hits       186553   188419    +1866     
- Misses      76985    77490     +505     
- Partials     7276     7364      +88     
Flag Coverage Δ
Debug 68.94% <75.07%> (+0.06%) ⬆️
production 63.26% <71.73%> (+0.06%) ⬆️
test 89.46% <97.52%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 72.76% <ø> (ø)
...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs 43.75% <ø> (-15.35%) ⬇️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 90.00% <ø> (+13.13%) ⬆️
...icrosoft.ML.Tokenizers/Utils/Helpers.netcoreapp.cs 39.47% <100.00%> (+1.63%) ⬆️
...crosoft.ML.Tokenizers.Tests/DoubleArrayTrieTest.cs 100.00% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs 97.10% <97.10%> (ø)
...t.ML.Tokenizers/Utils/OrdinalUtf8StringComparer.cs 59.61% <59.61%> (ø)
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs 70.63% <37.77%> (-20.46%) ⬇️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs 70.32% <56.77%> (-17.95%) ⬇️
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs 79.04% <79.04%> (ø)
... and 3 more

... and 11 files with indirect coverage changes

Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't dig too deep just a few small observations and suggestions.

@ericstj ericstj requested a review from Copilot February 18, 2025 22:09
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs:392

  • In this loop, calling sp.Slice(1) does not update the variable 'sp', which may result in an infinite loop if the condition remains true. Consider reassigning the sliced span to 'sp' (e.g., sp = sp.Slice(1)).
while (isPrevSpace && sp.Length > 0 && sp[0] == (byte)' ') { sp.Slice(1); }

Copy link
Contributor

@luisquintanilla luisquintanilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - SentencePieceUingramModel.cs should be SentencePieceUnigramModel.cs

Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing feedback. This looks good to me. Please check if @michaelgsharp has feedback too.

@tarekgh tarekgh merged commit 2bd88b9 into dotnet:main Feb 25, 2025
25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Sentencepiece Unigram tokenizer
5 participants