TestEval: Benchmarking Large Language Models for Test Case Generation
Wenhan Wang1*, Chenyuan Yang2*, Zhijie Wang1*, Yuheng Huang3, Zhaoyang Chu4,
Da Song1, Lingming Zhang2, An Ran Chen1, Lei Ma3, 1
1University of Alberta, 2University of Illinois Urbana-Champaign,
3The University of Tokyo, 4Huazhong University of Science and Technology
π TestEval Leaderboard
π How to interpret the results?
Overall coverage denotes the line/branch coverage by generating N test cases.
Coverage@k denotes the line/branch coverage by using only k out of N test cases.
Target line/branch/path coverage denotes the accuracy of covering specific line/branch/path by instruction.
Baseline denotes the accuracy of covering specific line/branch/path without instruction.
"Size" here is the amount of activated model weight during
inference.
π means open weights and open data. π means open weights and
open SFT data, but the base model is not data-open. What does this
imply? ππ models open-source the data such that one can
concretely reason about contamination.
π BibTeX
@inproceedings{wang2025testeval, title={TESTEVAL: Benchmarking Large Language Models for Test Case Generation}, author={Wenhan Wang and Chenyuan Yang and Zhijie Wang and Yuheng Huang and Zhaoyang Chu and Da Song and Lingming Zhang and An Ran Chen and Lei Ma}, booktitle = {Findings of the Association for Computational Linguistics: NAACL 2025}, year={2025} }