Open
Description
When running sb-cli submit swe-bench-m test
submissions, we are seeing a large variance in the Failed runs
from run to run. The variance can be very large (4 is the lowest we have seen with the worst being over 100) using the exact same submission patches. This is affecting our overall resolved score.
Do you have any guidance on this?
Metadata
Metadata
Assignees
Labels
No labels