-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Introducing SentencePiece Unigram Tokenizer Model #7390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing SentencePiece Unigram Tokenizer Model #7390
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.
Files not reviewed (3)
- THIRD-PARTY-NOTICES.TXT: Language not supported
- eng/Versions.props: Language not supported
- src/Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs: Evaluated as low risk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't dig too deep just a few small observations and suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
src/Microsoft.ML.Tokenizers/Normalizer/SentencePieceNormalizer.cs:392
- In this loop, calling sp.Slice(1) does not update the variable 'sp', which may result in an infinite loop if the condition remains true. Consider reassigning the sliced span to 'sp' (e.g., sp = sp.Slice(1)).
while (isPrevSpace && sp.Length > 0 && sp[0] == (byte)' ') { sp.Slice(1); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - SentencePieceUingramModel.cs
should be SentencePieceUnigramModel.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing feedback. This looks good to me. Please check if @michaelgsharp has feedback too.
Fixes #7186
We have been supporting SentencePiece Bpe model for a while, the change here is introducing the support to the SentencePiece Unigram tokenizer model.
Users can create a tokenizer instance by using a code like the following:
Notes around the change
SentencePieceToknizer.cs
file into a newly introduced fileSentencePieceBaseModel.cs
andSentencePieceBpeModel.cs
. Most of the code in the newly introduced files are mostly not changed.SentencePieceUnigramModel.cs
file.SentencePieceTokenizer.cs
to work with the model abstraction and automatically handle both models Bpe and Unigram.