Skip to content

Discrepancies between encode and decode in SimpleTokenizer #305

Closed
@Quasimondo

Description

@Quasimondo

I just came across a phenomenon which I find a bit confusing - not sure if this is expected behavior or shows an issue with SimpleTokenizer. Taking certain sequences of tokens and decoding them and then encoding the returned text again can result in a different sequence of tokens (whilst the two texts are identical).

Example:

from clip.simple_tokenizer import SimpleTokenizer

st = SimpleTokenizer()

tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)

tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)

print(text_out1)
print(text_out2)
print(text_out1==text_out2)

print(tokens2)

In this example the input tokens were
[6808, 31978, 23644, 32441, 41003, 16125]
but the output tokens are
[6808, 31978, 950, 544, 36461, 41003, 16125]

The texts in both cases are identical an "critical murraycaneleroy brandi xiv "

Looking into the code one possibility was that the text cleanup methods are causing this, but overriding the two methods yields the same results:

import clip.simple_tokenizer 

def basic_clean(text):
    print("basic_clean called:",text)
    return text

def whitespace_clean(text):
    print("whitespace_clean called:",text)
    return text

clip.simple_tokenizer.basic_clean = basic_clean   
clip.simple_tokenizer.whitespace_clean = whitespace_clean   
 
st = clip.simple_tokenizer.SimpleTokenizer()
    
clip.simple_tokenizer.basic_clean = basic_clean    

tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)

tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)

print(tokens1)
print(tokens2)

print(">"+text_out1+"<")
print(">"+text_out2+"<")
print(text_out1==text_out2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions