You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: evaluation/benchmarks/swe_bench/README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,8 @@
2
2
3
3
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).
4
4
5
+
**UPDATE (6/15/2025): We now support running SWE-bench-Live evaluation (see the paper [here](https://arxiv.org/abs/2505.23419))! For how to run it, checkout [this README](./SWE-bench-Live.md).**
6
+
5
7
**UPDATE (5/26/2025): We now support running interactive SWE-Bench evaluation (see the paper [here](https://arxiv.org/abs/2502.13069))! For how to run it, checkout [this README](./SWE-Interact.md).**
6
8
7
9
**UPDATE (4/8/2025): We now support running SWT-Bench evaluation! For more details, checkout [the corresponding section](#SWT-Bench-Evaluation).**
SWE-bench-Live is a live benchmark for issue resolving, providing a dataset that contains the latest issue tasks. This document explains how to run the evaluation of OpenHands on SWE-bench-Live.
12
+
13
+
Since SWE-bench-Live has an almost identical setting to SWE-bench, you only need to simply change the dataset name to `SWE-bench-Live/SWE-bench-Live`, the other parts are basically the same as running on SWE-bench.
14
+
15
+
## Setting Up
16
+
17
+
Set up the development environment and configure your LLM provider by following the [README](README.md).
18
+
19
+
## Running Inference
20
+
21
+
Use the same script, but change the dataset name to `SWE-bench-Live` and select the split (either `lite` or `full`). The lite split contains 300 instances from the past six months, while the full split includes 1,319 instances created after 2024.
In the original SWE-bench-Live paper, max_iterations is set to 100.
28
+
29
+
```shell
30
+
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.your_llm HEAD CodeActAgent 300 100 3 SWE-bench-Live/SWE-bench-Live lite
31
+
```
32
+
33
+
## Evaluating Results
34
+
35
+
After OpenHands generates patch results for each issue, we evaluate the results using the [SWE-bench-Live evaluation harness](https://github.com/microsoft/SWE-bench-Live).
36
+
37
+
Convert to the format of predictions for SWE benchmarks:
38
+
39
+
```shell
40
+
# You can find output.jsonl in evaluation/evaluation_outputs
Please refer to the original [SWE-bench-Live repository](https://github.com/microsoft/SWE-bench-Live) to set up the evaluation harness and use the provided scripts to generate the evaluation report:
45
+
46
+
```shell
47
+
python -m swebench.harness.run_evaluation \
48
+
--dataset_name SWE-bench-Live/SWE-bench-Live \
49
+
--split lite \
50
+
--namespace starryzhang \
51
+
--predictions_path preds.jsonl \
52
+
--max_workers 10 \
53
+
--run_id openhands
54
+
```
55
+
56
+
## Citation
57
+
58
+
```bibtex
59
+
@article{zhang2025swebenchgoeslive,
60
+
title={SWE-bench Goes Live!},
61
+
author={Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Junhao Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Dongmei Zhang},
0 commit comments