-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Address the feedback on the tokenizer's library #7024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rt the cases in the Model abstraction
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7024 +/- ##
==========================================
- Coverage 68.83% 68.79% -0.04%
==========================================
Files 1258 1254 -4
Lines 250672 250204 -468
Branches 25615 25529 -86
==========================================
- Hits 172547 172125 -422
+ Misses 71493 71468 -25
+ Partials 6632 6611 -21
Flags with carried forward coverage won't be shown. Click here to find out more.
|
{ | ||
var mergeRanks = new Dictionary<(string, string), int>(); | ||
var mergeRanks = new Cache<(string, string), int>(60_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this 60k come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loaded data from the merge file is 50K. I give it 10K more to grow.
@@ -77,9 +77,10 @@ public sealed class Bpe : Model | |||
/// <param name="unknownToken"> The unknown token to be used by the model.</param> | |||
/// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param> | |||
/// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param> | |||
public Bpe(string vocabFile, string? mergesFile, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null) : | |||
/// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble understanding what this means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If encoding text with Bpe model, for the tokens that the model doesn't recognize, it uses the unknown token for it. Most users uses [Unk]
for the unknow token. It is possible to get multiple [Unk]
tokens next to each others in the result. Settings fuseUnknownTokens
to true cause all [Unk]
sequence to collapse into one [Ukn]
. Fuse term is used by Huggingface and users of Bpe are familiar with that. If you have better explanation we can use here I'll be happy to use it :-)
This fix address the feedback reported in the issues: