Discrepancies between encode and decode in SimpleTokenizer

I just came across a phenomenon which I find a bit confusing - not sure if this is expected behavior or shows an issue with SimpleTokenizer. Taking certain sequences of tokens and decoding them and then encoding the returned text again can result in a different sequence of tokens (whilst the two texts are identical).

Example:

```
from clip.simple_tokenizer import SimpleTokenizer

st = SimpleTokenizer()

tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)

tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)

print(text_out1)
print(text_out2)
print(text_out1==text_out2)

print(tokens2)
```

In this example the input tokens were
[6808, 31978, 23644, 32441, 41003, 16125]
but the output tokens are
[6808, 31978, 950, 544, 36461, 41003, 16125]

The texts in both cases are identical an "critical murraycaneleroy brandi xiv "

Looking into the code one possibility was that the text cleanup methods are causing this, but overriding the two methods yields the same results:

```
import clip.simple_tokenizer 

def basic_clean(text):
    print("basic_clean called:",text)
    return text

def whitespace_clean(text):
    print("whitespace_clean called:",text)
    return text

clip.simple_tokenizer.basic_clean = basic_clean   
clip.simple_tokenizer.whitespace_clean = whitespace_clean   
 
st = clip.simple_tokenizer.SimpleTokenizer()
    
clip.simple_tokenizer.basic_clean = basic_clean    

tokens1 = [6808, 31978, 23644, 32441, 41003, 16125]
text_out1 = st.decode(tokens1)

tokens2 = st.encode(text_out1)
text_out2 = st.decode(tokens2)

print(tokens1)
print(tokens2)

print(">"+text_out1+"<")
print(">"+text_out2+"<")
print(text_out1==text_out2)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discrepancies between encode and decode in SimpleTokenizer #305

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discrepancies between encode and decode in SimpleTokenizer #305

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions