Fix #24: Fix eval script to look for explicit passes #95

john-b-yang · 2025-07-05T21:41:26Z

Context

In the existing design of SWE-smith, we make several assumptions about a test's status if it doesn't show up in the standard output from when the test suite is run before/after a patch is applied (baked into the get_valid_report logic):

If a test doesn't show up before the patch is applied, and is PASS after the patch is applied, we assume the test was failing before (FAIL_TO_PASS test)
If a test doesn't show up before the patch is applied, and is FAIL after the patch is applied, we assume the test was passing before (PASS_TO_FAIL test)
If a test is FAIL before the patch is applied, and doesn't show up after the patch is applied, we assume the test is passing after (FAIL_TO_PASS test)

Consequently, the current design also makes two key choices:

In get_eval_tests_report: If a test doesn't show up after the patch is applied, we automatically assume the test is PASS.
In the get_test_cmd implementation, to make evaluation faster, we do not run the entire test suite. Instead, we only run the test files containing FAIL_TO_PASS tests.
- As a result, any PASS_TO_PASS tests that fall outside of these files are automatically assumed to be passing, because of the get_eval_tests_report

The main benefit of this design is that the evaluation is much faster - instead of running the full suite, we only run a fraction of the tests.

Problems with Current Design

However, #24 (along w/ some concurrent works) wisely points out that this design, which (1) assumes non-present tests are passing, and (2) doesn't run the entire test suite, makes it possible to "reward hack" the evaluation.

Specifically, given a task instance, instead of writing a fix, a policy could add bug(s) that causes the test suite to fail to run at all, such that no tests or their statuses show up. The existing evaluation would mark this as correct.

Proposed Fixes

Therefore, this PR makes the general choice to (1) remove the aforementioned assumptions and (2) look for explicit indication that a test has passed.
The main changes are:

In get_valid_report, the assumptions about a test's state if it doesn't show up are removed.
In get_eval_tests_report, a test is only marked as PASS if it shows up in the post-patch execution logs. If it doesn't show up, it is assumed to be FAIL, not PASS like before.
In get_test_cmd, we remove logic that only runs the files with FAIL_TO_PASS tests for evaluation - instead, the entire test suite is run.
The logic for only running F2P test files for evaluation is preserved, but is not enabled by default (only runs with --f2p_only flag provided)

Minor Changes

Remove patch_is_None, patch_successfully_applied fields as they weren't being used.
Remove filter_irrelevant_tests helper func in get_eval_report, since we're no longer running FAIL_TO_PASS tests' files exclusively.
max_workers flag renamed to workers
Rename test_exts to exts, add check for exts in extract_entities. This is to enable creating bugs for only specific file types for certain repositories (e.g. for Golang repos, if exts = ['.go'], the logic will ignore non-Go files.

These designs should ensure that SWE-smith's evaluation criteria is no longer prone to reward hacking.

The downside is that now, evaluation will take longer (on average ~8-10s instead of 1-2s per instance). However, we can re-introduce such optimizations more carefully in future PRs.

codecov · 2025-07-05T21:42:26Z

Codecov Report

Attention: Patch coverage is 82.02247% with 16 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
swesmith/harness/grading.py	65.51%	10 Missing ⚠️
swesmith/profiles/base.py	77.77%	4 Missing ⚠️
swesmith/profiles/python.py	50.00%	2 Missing ⚠️

Files with missing lines	Coverage Δ
swesmith/bug_gen/adapters/__init__.py	`100.00% <ø> (ø)`
swesmith/profiles/golang.py	`99.49% <100.00%> (-0.01%)`	⬇️
tests/harness/test_grading.py	`100.00% <100.00%> (ø)`
tests/profiles/test_base.py	`96.25% <100.00%> (-0.13%)`	⬇️
tests/profiles/test_profiles_golang.py	`100.00% <100.00%> (ø)`
swesmith/profiles/python.py	`94.45% <50.00%> (+0.12%)`	⬆️
swesmith/profiles/base.py	`88.29% <77.77%> (-1.87%)`	⬇️
swesmith/harness/grading.py	`79.16% <65.51%> (+4.93%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ssumption in `get_valid_report`

acrmp

This seems like a good change to make. Do there need to be other changes to the tests to test the grading logic changes?

swesmith/profiles/base.py

john-b-yang · 2025-07-07T07:36:01Z

Thanks for review @acrmp! Yeah I can add a couple more tests tomorrow!

I also ended up making the decision to preserve the f2p speedup logic, but it can only be enabled via inclusion of an --f2p_only flag. I think the speedup is quite reasonable for most repositories, and in some situations I think the speedup at the expense of less complete evaluation is worth it, particularly for training.

That said, we can evolve the logic further if this flag is something that only takes effects for certain repos.

john-b-yang · 2025-07-07T17:16:27Z

Ok just added a couple more tests for grading.py!

@acrmp @klieret @Broyojo @MarcCote if you happen to have time, would love a review! But no pressure, I think the fixes should address the key original issues - will merge within the next 2 hours.

MarcCote · 2025-07-07T20:52:47Z

That's great. Thanks for the changes.

john-b-yang added 2 commits July 5, 2025 05:13

fix

58c8547

Undo get_valid_report edit

90c0e90

Remove run only FAIL_TO_PASS test files logic; Remove test status a…

943108f

…ssumption in `get_valid_report`

john-b-yang requested review from klieret and acrmp July 5, 2025 22:36

john-b-yang added 2 commits July 5, 2025 22:42

Remove test file filtering logic in get_eval_report

647bc1f

Remove unused keys

6be777b

john-b-yang mentioned this pull request Jul 5, 2025

SWE-Smith Reward Hacking Exploit #24

Closed

Bump up timeout

1dfd5b7

acrmp reviewed Jul 6, 2025

View reviewed changes

swesmith/profiles/base.py Outdated Show resolved Hide resolved

john-b-yang added 3 commits July 7, 2025 03:26

Rename max_workers -> workers

50b8bf3

More efficient _mirror_exists function

8e10302

Add f2p only speedup logic back, but only can be enabled w/ flag

2e35c17

john-b-yang added 2 commits July 7, 2025 16:50

Add NotImplementedError for speedup

40016c9

Slightly simplify grading.py, add tests

eb77df9

john-b-yang added 2 commits July 7, 2025 18:43

Add ext check in extract_entities

2ff561e

Fix file ext check bug

29ea748

john-b-yang merged commit fd89801 into main Jul 7, 2025
7 checks passed

john-b-yang deleted the john/reward-hack branch July 7, 2025 21:15

MarcCote mentioned this pull request Jul 11, 2025

Even without --f2p_only, not the entire test suite is being run. #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #24: Fix eval script to look for explicit passes #95

Fix #24: Fix eval script to look for explicit passes #95

Uh oh!

john-b-yang commented Jul 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 5, 2025 •

edited

Loading

Uh oh!

acrmp left a comment

Uh oh!

Uh oh!

john-b-yang commented Jul 7, 2025

Uh oh!

john-b-yang commented Jul 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

MarcCote commented Jul 7, 2025

Uh oh!

Uh oh!

Fix #24: Fix eval script to look for explicit passes #95

Fix #24: Fix eval script to look for explicit passes #95

Uh oh!

Conversation

john-b-yang commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Problems with Current Design

Proposed Fixes

Uh oh!

codecov bot commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

acrmp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

john-b-yang commented Jul 7, 2025

Uh oh!

john-b-yang commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MarcCote commented Jul 7, 2025

Uh oh!

Uh oh!

john-b-yang commented Jul 5, 2025 •

edited

Loading

codecov bot commented Jul 5, 2025 •

edited

Loading

john-b-yang commented Jul 7, 2025 •

edited

Loading