Open
Description
Hi!
If I understood correctly, SWE-agent-LM-32B
was trained on each assistant response (on each agent action).
For example, in this script SWE-smith/swesmith/train/run/ft_unsloth.py
:
...
trainer = train_on_responses_only(
trainer,
instruction_part="<|im_start|>user\n",
response_part="<|im_start|>assistant\n",
)
...
Have you experimented with not including “bad” actions of the agent in the loss function?
For example, the same repeated actions to correct a piece of code or actions that lead to an error. Roughly speaking, I would like to take these actions into context, but not add these tokens to CE loss.
Metadata
Metadata
Assignees
Labels
No labels