Skip to content

Unigram tokenizer fixes #7409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 5, 2025
Merged

Unigram tokenizer fixes #7409

merged 3 commits into from
Mar 5, 2025

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Mar 4, 2025

No description provided.

@Copilot Copilot AI review requested due to automatic review settings March 4, 2025 20:09
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR fixes and refines the behavior of the unigram tokenizer, including enhanced test coverage for control characters and Unicode decomposition, improvements in model vocabulary mapping in the SentencePieceUnigramModel, and adjustments to the normalization logic and token ID assignments.

  • Added additional unit tests for control character handling and Unicode decomposition.
  • Updated vocabulary mapping in SentencePieceUnigramModel with debug assertions and conditional pad handling.
  • Modified normalization checks in SentencePieceNormalizer and simplified token ID assignments in SentencePieceBaseModel.

Reviewed Changes

File Description
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs Added new test cases for control characters and Unicode decomposition
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs Updated vocabulary initialization to use TrainerSpec values with added debug assertions
src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs Replaced default Memory checks with normalizedPrefix.Length comparisons for clarity
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs Simplified token ID assignment by directly using TrainerSpec values, removing default fallbacks

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs:358

  • Verify that checking normalizedPrefix.Length == 0 correctly distinguishes between a default Memory and an intentionally empty normalized prefix. If an empty normalized prefix is a valid outcome, consider a more explicit condition to avoid ambiguity.
ReadOnlySpan<byte> normalizedByte = normalizedPrefix.Length == 0 ? input.Slice(0, p) : normalizedPrefix.Span;

src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs:28

  • Removing the fallback for BOS (and similarly for EOS and UNK) IDs may lead to negative or zero token IDs if TrainerSpec values are not strictly positive. Consider adding validation (e.g., Debug.Assert) to ensure these IDs are valid.
BeginningOfSentenceId = modelProto.TrainerSpec.BosId;

@tarekgh
Copy link
Member Author

tarekgh commented Mar 4, 2025

Debug.Assert(modelProto.TrainerSpec.BosId >= 0);
Debug.Assert(modelProto.TrainerSpec.EosId >= 0);

_vocab[modelProto.TrainerSpec.UnkPiece] = modelProto.TrainerSpec.UnkId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this present in the original tokenizer or is it special to our port? Asking because I don't understand why this isn't handled via modelProto.Pieces. Are we certain we only need these 3 special cases? Comment might help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is special to our port to ensure adding these special tokens to the vocabs for easier lookup. Adding these here are not changing any behavior more than allowing the vocabulary to map these tokens which help us in some operations like decoding for example.

I'll add some detailed comment here. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you know for sure that modelProto.Pieces.Count used when allocating _vocabReverse is greater than these IDs? I guess so since these IDs come from the modelProto too. Just seems odd to me that we step through all the modelProto.Pieces.Count above, but then we come back and overwrite some of the IDs like this.

Copy link
Member Author

@tarekgh tarekgh Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just seems odd to me that we step through all the modelProto.Pieces.Count above, but then we come back and overwrite some of the IDs like this.

This is done this way to avoid adding these special tokens to the trie as these shouldn't be part of it. The native code is also not adding such tokens when enumerating the modelProto.Pieces. It is only our addition is we add these to the vocab after we are done building the trie for easier mapping internally.

Can you know for sure that modelProto.Pieces.Count used when allocating _vocabReverse is greater than these IDs?

I can check but I want to know what you suggest doing when for any reason have wrong data, just throw exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit exception might be better than index out of range, but your call.

This has made me wonder twice - in initial PR and here - so this logic warrants a comment in source to explain what's going on.

Copy link

codecov bot commented Mar 4, 2025

Codecov Report

Attention: Patch coverage is 83.78378% with 6 lines in your changes missing coverage. Please review.

Project coverage is 68.97%. Comparing base (0807bd8) to head (cccac90).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 64.28% 4 Missing and 1 partial ⚠️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7409   +/-   ##
=======================================
  Coverage   68.97%   68.97%           
=======================================
  Files        1481     1481           
  Lines      273666   273696   +30     
  Branches    28287    28285    -2     
=======================================
+ Hits       188760   188789   +29     
- Misses      77511    77517    +6     
+ Partials     7395     7390    -5     
Flag Coverage Δ
Debug 68.97% <83.78%> (+<0.01%) ⬆️
production 63.27% <68.42%> (+<0.01%) ⬆️
test 89.46% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs 78.38% <100.00%> (+0.54%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs 94.10% <100.00%> (+0.16%) ⬆️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs 71.51% <50.00%> (ø)
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 65.89% <64.28%> (-0.11%) ⬇️

... and 9 files with indirect coverage changes

@tarekgh
Copy link
Member Author

tarekgh commented Mar 5, 2025

/ba-g unrelated failures and looks infrastructure

@tarekgh tarekgh merged commit 12ce84a into dotnet:main Mar 5, 2025
22 of 25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants