-
Notifications
You must be signed in to change notification settings - Fork 39
Fix #24: Fix eval script to look for explicit passes #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
🚀 New features to boost your workflow:
|
…ssumption in `get_valid_report`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good change to make. Do there need to be other changes to the tests to test the grading logic changes?
Thanks for review @acrmp! Yeah I can add a couple more tests tomorrow! I also ended up making the decision to preserve the f2p speedup logic, but it can only be enabled via inclusion of an That said, we can evolve the logic further if this flag is something that only takes effects for certain repos. |
That's great. Thanks for the changes. |
Context
In the existing design of SWE-smith, we make several assumptions about a test's status if it doesn't show up in the standard output from when the test suite is run before/after a patch is applied (baked into the
get_valid_report
logic):PASS
after the patch is applied, we assume the test was failing before (FAIL_TO_PASS
test)FAIL
after the patch is applied, we assume the test was passing before (PASS_TO_FAIL
test)FAIL
before the patch is applied, and doesn't show up after the patch is applied, we assume the test is passing after (FAIL_TO_PASS
test)Consequently, the current design also makes two key choices:
get_eval_tests_report
: If a test doesn't show up after the patch is applied, we automatically assume the test isPASS
.get_test_cmd
implementation, to make evaluation faster, we do not run the entire test suite. Instead, we only run the test files containingFAIL_TO_PASS
tests.PASS_TO_PASS
tests that fall outside of these files are automatically assumed to be passing, because of theget_eval_tests_report
The main benefit of this design is that the evaluation is much faster - instead of running the full suite, we only run a fraction of the tests.
Problems with Current Design
However, #24 (along w/ some concurrent works) wisely points out that this design, which (1) assumes non-present tests are passing, and (2) doesn't run the entire test suite, makes it possible to "reward hack" the evaluation.
Specifically, given a task instance, instead of writing a fix, a policy could add bug(s) that causes the test suite to fail to run at all, such that no tests or their statuses show up. The existing evaluation would mark this as correct.
Proposed Fixes
Therefore, this PR makes the general choice to (1) remove the aforementioned assumptions and (2) look for explicit indication that a test has passed.
The main changes are:
get_valid_report
, the assumptions about a test's state if it doesn't show up are removed.get_eval_tests_report
, a test is only marked asPASS
if it shows up in the post-patch execution logs. If it doesn't show up, it is assumed to beFAIL
, notPASS
like before.get_test_cmd
, we remove logic that only runs the files withFAIL_TO_PASS
tests for evaluation - instead, the entire test suite is run.--f2p_only
flag provided)Minor Changes
patch_is_None
,patch_successfully_applied
fields as they weren't being used.filter_irrelevant_tests
helper func inget_eval_report
, since we're no longer runningFAIL_TO_PASS
tests' files exclusively.max_workers
flag renamed toworkers
test_exts
toexts
, add check forexts
inextract_entities
. This is to enable creating bugs for only specific file types for certain repositories (e.g. for Golang repos, ifexts = ['.go']
, the logic will ignore non-Go files.These designs should ensure that SWE-smith's evaluation criteria is no longer prone to reward hacking.
The downside is that now, evaluation will take longer (on average ~8-10s instead of 1-2s per instance). However, we can re-introduce such optimizations more carefully in future PRs.