Large variance in the failed runs for swe-bench-m submissions

When running `sb-cli submit swe-bench-m test` submissions, we are seeing a large variance in the `Failed runs` from run to run. The variance can be very large (4 is the lowest we have seen with the worst being over 100) using the exact same submission patches. This is affecting our overall resolved score. 

Do you have any guidance on this?