Closed
Description
I just came across a phenomenon which I find a bit confusing - not sure if this is expected behavior or shows an issue with SimpleTokenizer. Taking certain sequences of tokens and decoding them and then encoding the returned text again can result in a different sequence of tokens (whilst the two texts are identical).
Example:
from clip.simple_tokenizer import SimpleTokenizer
st = SimpleTokenizer()
tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)
tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)
print(text_out1)
print(text_out2)
print(text_out1==text_out2)
print(tokens2)
In this example the input tokens were
[6808, 31978, 23644, 32441, 41003, 16125]
but the output tokens are
[6808, 31978, 950, 544, 36461, 41003, 16125]
The texts in both cases are identical an "critical murraycaneleroy brandi xiv "
Looking into the code one possibility was that the text cleanup methods are causing this, but overriding the two methods yields the same results:
import clip.simple_tokenizer
def basic_clean(text):
print("basic_clean called:",text)
return text
def whitespace_clean(text):
print("whitespace_clean called:",text)
return text
clip.simple_tokenizer.basic_clean = basic_clean
clip.simple_tokenizer.whitespace_clean = whitespace_clean
st = clip.simple_tokenizer.SimpleTokenizer()
clip.simple_tokenizer.basic_clean = basic_clean
tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)
tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)
print(tokens1)
print(tokens2)
print(">"+text_out1+"<")
print(">"+text_out2+"<")
print(text_out1==text_out2)
Metadata
Metadata
Assignees
Labels
No labels