You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just came across a phenomenon which I find a bit confusing - not sure if this is expected behavior or shows an issue with SimpleTokenizer. Taking certain sequences of tokens and decoding them and then encoding the returned text again can result in a different sequence of tokens (whilst the two texts are identical).
Okay after some further looking into this I realize that this is simply a fact of life - there are simply multiple ways the same word can get encoded into tokens, e.g.
The word "murraycaneleroy" can either be tokenized as
murray | cane | leroy
or as
murray | can | el | eroy
And there is no way for the decoder to know which token combination created that word.
I just came across a phenomenon which I find a bit confusing - not sure if this is expected behavior or shows an issue with SimpleTokenizer. Taking certain sequences of tokens and decoding them and then encoding the returned text again can result in a different sequence of tokens (whilst the two texts are identical).
Example:
In this example the input tokens were
[6808, 31978, 23644, 32441, 41003, 16125]
but the output tokens are
[6808, 31978, 950, 544, 36461, 41003, 16125]
The texts in both cases are identical an "critical murraycaneleroy brandi xiv "
Looking into the code one possibility was that the text cleanup methods are causing this, but overriding the two methods yields the same results:
The text was updated successfully, but these errors were encountered: