Description
Hello SWE-bench leaderboard community, thanks for your continued enthusiasm in using SWE-bench!
We've recently been cleaning up the leaderboard after a long delay, apologies to all those who had to wait and thanks for your patience!
Based on the lessons we've learned, going forward, you can expect two changes in SWE-bench:
- First, we will be much more proactive about merging submissions. Your submission will either be merged (if there are no issues) or commented on within 1 week. If you're waiting longer than that, please ping @john-b-yang or @ofirpress.
- Second, we've recently updated the
README.md
andchecklist.md
to be more clear about the submission process. We encourage you to go through it briefly when you have time.
We'd like to remind the community of several goals we have when it comes to how we curate the leaderboard.
- We want to be accepting of all submissions.
- We want submissions to be a meaningful contribution to the development of AI for SWE.
- We want submissions to be as transparent + informative of how the underlying system works.
- We do not want our leaderboard to become an advertisement for new products. Please see ProductHunt if that is your goal.
Generally, we believe the large majority of submissions have met this standard!
To continue this good trend, we'd like to highlight the following submission requirement
Note
For your submission, please provide a description of your system (arXiv paper, technical report, blog post), or a link to it. Also add as info / report
in metadata.yaml
.
The most time-consuming aspect of leaderboard curation is when a submission provides little to no meaningful information for how their system works. Please refer to this list as examples of good reports:
- SWE-agent, code
- OpenHands, code
- Agentless, code
- Weights & Biases
- Nvidia CORTEXA
- Anthropic Claude 3.5 Sonnet
Going forwards, please continue to respect this requirement.
- Generally, academics and industry research teams have provided papers, system cards, or technical blogs that satisfy these requirements.
- Most of our struggle has come from startups - if you are a startup team, please know that we welcome your submissions, but you must, must provide an informative technical blog, particularly if your code is closed source.