convert: Swap GLM4 EOS / EOT token #13505
Open
+2
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When testing GLM4 models I noticed that they leak <|endoftext|> this happens because the EOS token is set to <|endoftext|> and the EOT token is set to <|user|>. The huggingface config.json defines the EOS as <|user|> which results in <|user|> being used for both and the <|endoftext|> token not being treated like an end of text token.
This PR is a simple reversal of the definitions, this allows finetunes to keep overriding the EOS should they need to while <|endoftext|> is treated like an EOT token.
In my test conversion this gives the following result:
INFO:gguf.vocab:Setting special token type eos to 151336
INFO:gguf.vocab:Setting special token type pad to 151329
INFO:gguf.vocab:Setting special token type eot to 151329
INFO:gguf.vocab:Setting special token type unk to 151329
INFO:gguf.vocab:Setting special token type bos to 151329
Before this PR the result is:
INFO:gguf.vocab:Setting special token type eos to 151336
INFO:gguf.vocab:Setting special token type pad to 151329
INFO:gguf.vocab:Setting special token type eot to 151336
INFO:gguf.vocab:Setting special token type unk to 151329
INFO:gguf.vocab:Setting special token type bos to 151329
If you prefer to solve this in a different manner (such as forcing the EOS to be set according to the internal converters definition) feel free to reject this PR and we can open an issue instead.