Unigram tokenizer fixes #7409

tarekgh · 2025-03-04T20:09:11Z

No description provided.

Copilot

PR Overview

This PR fixes and refines the behavior of the unigram tokenizer, including enhanced test coverage for control characters and Unicode decomposition, improvements in model vocabulary mapping in the SentencePieceUnigramModel, and adjustments to the normalization logic and token ID assignments.

Added additional unit tests for control character handling and Unicode decomposition.
Updated vocabulary mapping in SentencePieceUnigramModel with debug assertions and conditional pad handling.
Modified normalization checks in SentencePieceNormalizer and simplified token ID assignments in SentencePieceBaseModel.

Reviewed Changes

File	Description
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	Added new test cases for control characters and Unicode decomposition
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	Updated vocabulary initialization to use TrainerSpec values with added debug assertions
src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs	Replaced default Memory checks with normalizedPrefix.Length comparisons for clarity
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	Simplified token ID assignment by directly using TrainerSpec values, removing default fallbacks

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs:358

Verify that checking normalizedPrefix.Length == 0 correctly distinguishes between a default Memory and an intentionally empty normalized prefix. If an empty normalized prefix is a valid outcome, consider a more explicit condition to avoid ambiguity.

ReadOnlySpan<byte> normalizedByte = normalizedPrefix.Length == 0 ? input.Slice(0, p) : normalizedPrefix.Span;

src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs:28

Removing the fallback for BOS (and similarly for EOS and UNK) IDs may lead to negative or zero token IDs if TrainerSpec values are not strictly positive. Consider adding validation (e.g., Debug.Assert) to ensure these IDs are valid.

BeginningOfSentenceId = modelProto.TrainerSpec.BosId;

tarekgh · 2025-03-04T20:11:33Z

CC @ericstj @michaelgsharp @luisquintanilla

src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs

ericstj · 2025-03-04T20:46:45Z

src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs

+            Debug.Assert(modelProto.TrainerSpec.BosId >= 0);
+            Debug.Assert(modelProto.TrainerSpec.EosId >= 0);
+
+            _vocab[modelProto.TrainerSpec.UnkPiece] = modelProto.TrainerSpec.UnkId;


Was this present in the original tokenizer or is it special to our port? Asking because I don't understand why this isn't handled via modelProto.Pieces. Are we certain we only need these 3 special cases? Comment might help.

This is special to our port to ensure adding these special tokens to the vocabs for easier lookup. Adding these here are not changing any behavior more than allowing the vocabulary to map these tokens which help us in some operations like decoding for example.

I'll add some detailed comment here. Thanks!

Can you know for sure that modelProto.Pieces.Count used when allocating _vocabReverse is greater than these IDs? I guess so since these IDs come from the modelProto too. Just seems odd to me that we step through all the modelProto.Pieces.Count above, but then we come back and overwrite some of the IDs like this.

Just seems odd to me that we step through all the modelProto.Pieces.Count above, but then we come back and overwrite some of the IDs like this.

This is done this way to avoid adding these special tokens to the trie as these shouldn't be part of it. The native code is also not adding such tokens when enumerating the modelProto.Pieces. It is only our addition is we add these to the vocab after we are done building the trie for easier mapping internally.

Can you know for sure that modelProto.Pieces.Count used when allocating _vocabReverse is greater than these IDs?

I can check but I want to know what you suggest doing when for any reason have wrong data, just throw exception?

Explicit exception might be better than index out of range, but your call.

This has made me wonder twice - in initial PR and here - so this logic warrants a comment in source to explain what's going on.

codecov · 2025-03-04T21:41:01Z

Codecov Report

Attention: Patch coverage is 83.78378% with 6 lines in your changes missing coverage. Please review.

Project coverage is 68.97%. Comparing base (0807bd8) to head (cccac90).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	64.28%	4 Missing and 1 partial ⚠️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7409   +/-   ##
=======================================
  Coverage   68.97%   68.97%           
=======================================
  Files        1481     1481           
  Lines      273666   273696   +30     
  Branches    28287    28285    -2     
=======================================
+ Hits       188760   188789   +29     
- Misses      77511    77517    +6     
+ Partials     7395     7390    -5

Flag	Coverage Δ
Debug	`68.97% <83.78%> (+<0.01%)`	⬆️
production	`63.27% <68.42%> (+<0.01%)`	⬆️
test	`89.46% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	`78.38% <100.00%> (+0.54%)`	⬆️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	`94.10% <100.00%> (+0.16%)`	⬆️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs	`71.51% <50.00%> (ø)`
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	`65.89% <64.28%> (-0.11%)`	⬇️

... and 9 files with indirect coverage changes

src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs

tarekgh · 2025-03-05T01:39:29Z

/ba-g unrelated failures and looks infrastructure

Unigram tokenizer fixes

b7a5036

Copilot AI review requested due to automatic review settings March 4, 2025 20:09

dotnet-policy-service bot assigned tarekgh Mar 4, 2025

Copilot AI reviewed Mar 4, 2025

View reviewed changes

ericstj reviewed Mar 4, 2025

View reviewed changes

src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs Show resolved Hide resolved

ericstj reviewed Mar 4, 2025

View reviewed changes

Add a comment

50e475c

build-analysis bot mentioned this pull request Mar 4, 2025

SdcaLogisticRegression failing with LogLoss value above 0.5 on Apple M1 #7343

Open

michaelgsharp reviewed Mar 4, 2025

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs Outdated Show resolved Hide resolved

michaelgsharp approved these changes Mar 4, 2025

View reviewed changes

Feedback

cccac90

tarekgh merged commit 12ce84a into dotnet:main Mar 5, 2025
22 of 25 checks passed

tarekgh mentioned this pull request Mar 19, 2025

Cleanup SentencePiece tokenizer #7427

Merged

github-actions bot locked and limited conversation to collaborators Apr 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unigram tokenizer fixes #7409

Unigram tokenizer fixes #7409

Uh oh!

tarekgh commented Mar 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

tarekgh commented Mar 4, 2025

Uh oh!

Uh oh!

ericstj Mar 4, 2025

Uh oh!

tarekgh Mar 4, 2025

Uh oh!

ericstj Mar 4, 2025

Uh oh!

tarekgh Mar 4, 2025 •

edited

Loading

Uh oh!

ericstj Mar 5, 2025

Uh oh!

codecov bot commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

tarekgh commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Unigram tokenizer fixes #7409

Unigram tokenizer fixes #7409

Uh oh!

Conversation

tarekgh commented Mar 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Uh oh!

tarekgh commented Mar 4, 2025

Uh oh!

Uh oh!

ericstj Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

tarekgh Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

ericstj Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

tarekgh Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericstj Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

tarekgh commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

tarekgh Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 4, 2025 •

edited

Loading