-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Unigram tokenizer fixes #7409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unigram tokenizer fixes #7409
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR fixes and refines the behavior of the unigram tokenizer, including enhanced test coverage for control characters and Unicode decomposition, improvements in model vocabulary mapping in the SentencePieceUnigramModel, and adjustments to the normalization logic and token ID assignments.
- Added additional unit tests for control character handling and Unicode decomposition.
- Updated vocabulary mapping in SentencePieceUnigramModel with debug assertions and conditional pad handling.
- Modified normalization checks in SentencePieceNormalizer and simplified token ID assignments in SentencePieceBaseModel.
Reviewed Changes
File | Description |
---|---|
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs | Added new test cases for control characters and Unicode decomposition |
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs | Updated vocabulary initialization to use TrainerSpec values with added debug assertions |
src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs | Replaced default Memory checks with normalizedPrefix.Length comparisons for clarity |
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs | Simplified token ID assignment by directly using TrainerSpec values, removing default fallbacks |
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (2)
src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs:358
- Verify that checking normalizedPrefix.Length == 0 correctly distinguishes between a default Memory and an intentionally empty normalized prefix. If an empty normalized prefix is a valid outcome, consider a more explicit condition to avoid ambiguity.
ReadOnlySpan<byte> normalizedByte = normalizedPrefix.Length == 0 ? input.Slice(0, p) : normalizedPrefix.Span;
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs:28
- Removing the fallback for BOS (and similarly for EOS and UNK) IDs may lead to negative or zero token IDs if TrainerSpec values are not strictly positive. Consider adding validation (e.g., Debug.Assert) to ensure these IDs are valid.
BeginningOfSentenceId = modelProto.TrainerSpec.BosId;
Debug.Assert(modelProto.TrainerSpec.BosId >= 0); | ||
Debug.Assert(modelProto.TrainerSpec.EosId >= 0); | ||
|
||
_vocab[modelProto.TrainerSpec.UnkPiece] = modelProto.TrainerSpec.UnkId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this present in the original tokenizer or is it special to our port? Asking because I don't understand why this isn't handled via modelProto.Pieces
. Are we certain we only need these 3 special cases? Comment might help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is special to our port to ensure adding these special tokens to the vocabs for easier lookup. Adding these here are not changing any behavior more than allowing the vocabulary to map these tokens which help us in some operations like decoding for example.
I'll add some detailed comment here. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you know for sure that modelProto.Pieces.Count
used when allocating _vocabReverse
is greater than these IDs? I guess so since these IDs come from the modelProto too. Just seems odd to me that we step through all the modelProto.Pieces.Count
above, but then we come back and overwrite some of the IDs like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just seems odd to me that we step through all the modelProto.Pieces.Count above, but then we come back and overwrite some of the IDs like this.
This is done this way to avoid adding these special tokens to the trie
as these shouldn't be part of it. The native code is also not adding such tokens when enumerating the modelProto.Pieces
. It is only our addition is we add these to the vocab after we are done building the trie for easier mapping internally.
Can you know for sure that
modelProto.Pieces.Count
used when allocating_vocabReverse
is greater than these IDs?
I can check but I want to know what you suggest doing when for any reason have wrong data, just throw exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit exception might be better than index out of range, but your call.
This has made me wonder twice - in initial PR and here - so this logic warrants a comment in source to explain what's going on.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7409 +/- ##
=======================================
Coverage 68.97% 68.97%
=======================================
Files 1481 1481
Lines 273666 273696 +30
Branches 28287 28285 -2
=======================================
+ Hits 188760 188789 +29
- Misses 77511 77517 +6
+ Partials 7395 7390 -5
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/ba-g unrelated failures and looks infrastructure |
No description provided.