Skip to content

Fix #24: Fix eval script to look for explicit passes #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 7, 2025

Conversation

john-b-yang
Copy link
Member

@john-b-yang john-b-yang commented Jul 5, 2025

Context

In the existing design of SWE-smith, we make several assumptions about a test's status if it doesn't show up in the standard output from when the test suite is run before/after a patch is applied (baked into the get_valid_report logic):

  • If a test doesn't show up before the patch is applied, and is PASS after the patch is applied, we assume the test was failing before (FAIL_TO_PASS test)
  • If a test doesn't show up before the patch is applied, and is FAIL after the patch is applied, we assume the test was passing before (PASS_TO_FAIL test)
  • If a test is FAIL before the patch is applied, and doesn't show up after the patch is applied, we assume the test is passing after (FAIL_TO_PASS test)

Consequently, the current design also makes two key choices:

  • In get_eval_tests_report: If a test doesn't show up after the patch is applied, we automatically assume the test is PASS.
  • In the get_test_cmd implementation, to make evaluation faster, we do not run the entire test suite. Instead, we only run the test files containing FAIL_TO_PASS tests.
    • As a result, any PASS_TO_PASS tests that fall outside of these files are automatically assumed to be passing, because of the get_eval_tests_report

The main benefit of this design is that the evaluation is much faster - instead of running the full suite, we only run a fraction of the tests.


Problems with Current Design

However, #24 (along w/ some concurrent works) wisely points out that this design, which (1) assumes non-present tests are passing, and (2) doesn't run the entire test suite, makes it possible to "reward hack" the evaluation.

Specifically, given a task instance, instead of writing a fix, a policy could add bug(s) that causes the test suite to fail to run at all, such that no tests or their statuses show up. The existing evaluation would mark this as correct.


Proposed Fixes

Therefore, this PR makes the general choice to (1) remove the aforementioned assumptions and (2) look for explicit indication that a test has passed.
The main changes are:

  • In get_valid_report, the assumptions about a test's state if it doesn't show up are removed.
  • In get_eval_tests_report, a test is only marked as PASS if it shows up in the post-patch execution logs. If it doesn't show up, it is assumed to be FAIL, not PASS like before.
  • In get_test_cmd, we remove logic that only runs the files with FAIL_TO_PASS tests for evaluation - instead, the entire test suite is run.
  • The logic for only running F2P test files for evaluation is preserved, but is not enabled by default (only runs with --f2p_only flag provided)

Minor Changes

  • Remove patch_is_None, patch_successfully_applied fields as they weren't being used.
  • Remove filter_irrelevant_tests helper func in get_eval_report, since we're no longer running FAIL_TO_PASS tests' files exclusively.
  • max_workers flag renamed to workers
  • Rename test_exts to exts, add check for exts in extract_entities. This is to enable creating bugs for only specific file types for certain repositories (e.g. for Golang repos, if exts = ['.go'], the logic will ignore non-Go files.

These designs should ensure that SWE-smith's evaluation criteria is no longer prone to reward hacking.

The downside is that now, evaluation will take longer (on average ~8-10s instead of 1-2s per instance). However, we can re-introduce such optimizations more carefully in future PRs.

Copy link

codecov bot commented Jul 5, 2025

Codecov Report

Attention: Patch coverage is 82.02247% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
swesmith/harness/grading.py 65.51% 10 Missing ⚠️
swesmith/profiles/base.py 77.77% 4 Missing ⚠️
swesmith/profiles/python.py 50.00% 2 Missing ⚠️
Files with missing lines Coverage Δ
swesmith/bug_gen/adapters/__init__.py 100.00% <ø> (ø)
swesmith/profiles/golang.py 99.49% <100.00%> (-0.01%) ⬇️
tests/harness/test_grading.py 100.00% <100.00%> (ø)
tests/profiles/test_base.py 96.25% <100.00%> (-0.13%) ⬇️
tests/profiles/test_profiles_golang.py 100.00% <100.00%> (ø)
swesmith/profiles/python.py 94.45% <50.00%> (+0.12%) ⬆️
swesmith/profiles/base.py 88.29% <77.77%> (-1.87%) ⬇️
swesmith/harness/grading.py 79.16% <65.51%> (+4.93%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@john-b-yang john-b-yang requested review from klieret and acrmp July 5, 2025 22:36
Copy link
Collaborator

@acrmp acrmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good change to make. Do there need to be other changes to the tests to test the grading logic changes?

@john-b-yang
Copy link
Member Author

Thanks for review @acrmp! Yeah I can add a couple more tests tomorrow!

I also ended up making the decision to preserve the f2p speedup logic, but it can only be enabled via inclusion of an --f2p_only flag. I think the speedup is quite reasonable for most repositories, and in some situations I think the speedup at the expense of less complete evaluation is worth it, particularly for training.

That said, we can evolve the logic further if this flag is something that only takes effects for certain repos.

@john-b-yang
Copy link
Member Author

john-b-yang commented Jul 7, 2025

Ok just added a couple more tests for grading.py!

@acrmp @klieret @Broyojo @MarcCote if you happen to have time, would love a review! But no pressure, I think the fixes should address the key original issues - will merge within the next 2 hours.

@john-b-yang john-b-yang merged commit fd89801 into main Jul 7, 2025
7 checks passed
@MarcCote
Copy link
Collaborator

MarcCote commented Jul 7, 2025

That's great. Thanks for the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants