Skip to content

Discrepancies between encode and decode in SimpleTokenizer #305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Quasimondo opened this issue Dec 15, 2022 · 1 comment
Closed

Discrepancies between encode and decode in SimpleTokenizer #305

Quasimondo opened this issue Dec 15, 2022 · 1 comment

Comments

@Quasimondo
Copy link

Quasimondo commented Dec 15, 2022

I just came across a phenomenon which I find a bit confusing - not sure if this is expected behavior or shows an issue with SimpleTokenizer. Taking certain sequences of tokens and decoding them and then encoding the returned text again can result in a different sequence of tokens (whilst the two texts are identical).

Example:

from clip.simple_tokenizer import SimpleTokenizer

st = SimpleTokenizer()

tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)

tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)

print(text_out1)
print(text_out2)
print(text_out1==text_out2)

print(tokens2)

In this example the input tokens were
[6808, 31978, 23644, 32441, 41003, 16125]
but the output tokens are
[6808, 31978, 950, 544, 36461, 41003, 16125]

The texts in both cases are identical an "critical murraycaneleroy brandi xiv "

Looking into the code one possibility was that the text cleanup methods are causing this, but overriding the two methods yields the same results:

import clip.simple_tokenizer 

def basic_clean(text):
    print("basic_clean called:",text)
    return text

def whitespace_clean(text):
    print("whitespace_clean called:",text)
    return text

clip.simple_tokenizer.basic_clean = basic_clean   
clip.simple_tokenizer.whitespace_clean = whitespace_clean   
 
st = clip.simple_tokenizer.SimpleTokenizer()
    
clip.simple_tokenizer.basic_clean = basic_clean    

tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)

tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)

print(tokens1)
print(tokens2)

print(">"+text_out1+"<")
print(">"+text_out2+"<")
print(text_out1==text_out2)
@Quasimondo
Copy link
Author

Okay after some further looking into this I realize that this is simply a fact of life - there are simply multiple ways the same word can get encoded into tokens, e.g.

The word "murraycaneleroy" can either be tokenized as
murray | cane | leroy
or as
murray | can | el | eroy

And there is no way for the decoder to know which token combination created that word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant