Introducing SentencePiece Unigram Tokenizer Model #7390

tarekgh · 2025-02-18T20:10:47Z

We have been supporting SentencePiece Bpe model for a while, the change here is introducing the support to the SentencePiece Unigram tokenizer model.

Users can create a tokenizer instance by using a code like the following:

// download the tokenizer.model from the LLM model site like the following:            
// @"https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/sentencepiece.bpe.model?download=true";
using Stream remoteStream = File.OpenRead(path/to/the/tokenizer/model/file);
SentencePieceTokenizer tokenizer = SentencePieceTokenizer.Create(remoteStream);

Notes around the change

Have done some refactoring code existed in the SentencePieceToknizer.cs file into a newly introduced file SentencePieceBaseModel.cs and SentencePieceBpeModel.cs. Most of the code in the newly introduced files are mostly not changed.
Added the new Unigram implementation into the SentencePieceUnigramModel.cs file.
Changed SentencePieceTokenizer.cs to work with the model abstraction and automatically handle both models Bpe and Unigram.
The change shouldn't introduce any behavior change to the existing model (i.e Bpe).

tarekgh · 2025-02-18T20:11:58Z

CC @ericstj @luisquintanilla @LittleLittleCloud @michaelgsharp @stephentoub

Copilot

Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.

Files not reviewed (3)

THIRD-PARTY-NOTICES.TXT: Language not supported
eng/Versions.props: Language not supported
src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs: Evaluated as low risk

src/Microsoft.ML.Tokenizers/Utils/OrdinalUtf8StringComparer.cs

codecov · 2025-02-18T21:59:33Z

Codecov Report

Attention: Patch coverage is 75.07347% with 933 lines in your changes missing coverage. Please review.

Project coverage is 68.94%. Comparing base (e3219a9) to head (d244042).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	60.18%	293 Missing and 47 partials ⚠️
...osoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs	77.43%	158 Missing and 32 partials ⚠️
...c/Microsoft.ML.Tokenizers/Utils/DoubleArrayTrie.cs	82.66%	107 Missing and 19 partials ⚠️
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	79.04%	73 Missing and 32 partials ⚠️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs	56.77%	61 Missing and 22 partials ⚠️
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs	37.77%	56 Missing ⚠️
...t.ML.Tokenizers/Utils/OrdinalUtf8StringComparer.cs	59.61%	16 Missing and 5 partials ⚠️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	97.10%	6 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7390      +/-   ##
==========================================
+ Coverage   68.88%   68.94%   +0.06%     
==========================================
  Files        1473     1480       +7     
  Lines      270814   273273    +2459     
  Branches    27884    28234     +350     
==========================================
+ Hits       186553   188419    +1866     
- Misses      76985    77490     +505     
- Partials     7276     7364      +88

Flag	Coverage Δ
Debug	`68.94% <75.07%> (+0.06%)`	⬆️
production	`63.26% <71.73%> (+0.06%)`	⬆️
test	`89.46% <97.52%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs	`72.76% <ø> (ø)`
...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs	`43.75% <ø> (-15.35%)`	⬇️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs	`90.00% <ø> (+13.13%)`	⬆️
...icrosoft.ML.Tokenizers/Utils/Helpers.netcoreapp.cs	`39.47% <100.00%> (+1.63%)`	⬆️
...crosoft.ML.Tokenizers.Tests/DoubleArrayTrieTest.cs	`100.00% <100.00%> (ø)`
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	`97.10% <97.10%> (ø)`
...t.ML.Tokenizers/Utils/OrdinalUtf8StringComparer.cs	`59.61% <59.61%> (ø)`
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs	`70.63% <37.77%> (-20.46%)`	⬇️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs	`70.32% <56.77%> (-17.95%)`	⬇️
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	`79.04% <79.04%> (ø)`
... and 3 more

... and 11 files with indirect coverage changes

ericstj

Didn't dig too deep just a few small observations and suggestions.

src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs

src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs

src/Microsoft.ML.Tokenizers/Utils/ByteArrayComparer.cs

src/Microsoft.ML.Tokenizers/Utils/OrdinalUtf8StringComparer.cs

test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs

src/Microsoft.ML.Tokenizers/Utils/DoubleArrayTrie.cs

Copilot

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs:392

In this loop, calling sp.Slice(1) does not update the variable 'sp', which may result in an infinite loop if the condition remains true. Consider reassigning the sliced span to 'sp' (e.g., sp = sp.Slice(1)).

while (isPrevSpace && sp.Length > 0 && sp[0] == (byte)' ') { sp.Slice(1); }

src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs

src/Microsoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs

luisquintanilla

Typo - SentencePieceUingramModel.cs should be SentencePieceUnigramModel.cs

src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs

src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs

src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs

test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs

src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs

ericstj

Thank you for addressing feedback. This looks good to me. Please check if @michaelgsharp has feedback too.

src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs

Introducing SentencePiece Unigram Tokenizer Model

12af680

Copilot AI review requested due to automatic review settings February 18, 2025 20:10

dotnet-policy-service bot assigned tarekgh Feb 18, 2025

Copilot AI reviewed Feb 18, 2025

View reviewed changes

src/Microsoft.ML.Tokenizers/Utils/OrdinalUtf8StringComparer.cs Outdated Show resolved Hide resolved

Copilot feedback addressing :-)

cca1840

tarekgh added this to the ML.NET 5.0 milestone Feb 18, 2025

ericstj reviewed Feb 18, 2025

View reviewed changes

ericstj requested a review from Copilot February 18, 2025 22:09

Copilot AI reviewed Feb 18, 2025

View reviewed changes

tarekgh added 4 commits February 18, 2025 15:09

addressing Copilot feedback

adc5110

Feedback Addressing 1

070c09a

Clean up

dbdd50a

More clean up

a995226