-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
In the course Supervised Fine-Tuning, author uses base model HuggingFaceTB/SmolLM3-3B-Base but I choose HuggingFaceTB/SmolLM2-135M because it is lighter. However, I found that the base model SmolLM2-135M does not have its own chat template but it already had special tokens. However, speical tokens may be incorrect, for example, bos_token and eos_token share the same token <|endoftext|>
I also refer to course LLM Course, Fine-Tuning with SFTTrainer and author uses setup_chat_format to create the chat template for base model's tokenizer which does not have its own chat template
However, setup_chat_format only supports chatml format and will be deprecated in trl version 0.26.0. That is why I use clone_chat_template instead.
But another issue appears here: while clone_chat_template only overwrites eos from source tokenizer to target tokenizer, the setup_chat_format overwrites all bos, eos, and pad tokens. After I try to clone Llama-3.2-Instruct's chat template, only eos changes to <|eot_id|>
model, tokenizer, added_tokens = clone_chat_template(model=model, tokenizer=tokenizer, source_tokenizer_path='meta-llama/Llama-3.2-1B-Instruct')
Question:
- Why in the base model, although the tokenizer does not have a chat template, it already has special tokens?
clone_chat_templatedoes not overwrite all special tokens like bos, eos, pad, ... so are there any training SFT impacts, and what is the solution for this?
I am new to SFT and I very appreciate any support. Thank you.