Welcome to the official repository for our survey paper:
“Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence”
With the rapid evolution of Large Language Models (LLMs) such as GPT-2, GPT-3, and their successors, there has been a transformative shift in the field of code intelligence, enabling significant advances in tasks like code generation, program repair, software testing, and debugging.
To ensure these models are evaluated rigorously and meaningfully, benchmarking plays a crucial role.
In this work, we systematically review:
- 142 research papers
- 156 unique benchmark datasets
- 32 different code-related tasks
We analyze each dataset across four key dimensions:
- General landscape and coverage
- Dataset construction and quality assurance
- Evaluation protocols
- Limitations and gaps
- Python is the most dominant language (used in 77% of datasets)
- GitHub is the primary data source (46% usage)
- Most benchmarks focus on code generation (86 datasets)
- Benchmark creation has notably accelerated in the past 3 years
- Gaps exist in terms of bias, dataset evolution, and standardized evaluation
You can read the full survey here: 📖 [https://hal.science/view/index/docid/5183398]
If you find this work useful in your research, please consider citing it:
@article{abdollahi:hal-05183398,
TITLE = {{Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence}},
AUTHOR = {Abdollahi, Mohammad and Zhang, Ruixin and Shiri Harzevili, Nima and Shin, Jiho and Wang, Song and Hemmati, Hadi},
URL = {https://hal.science/hal-05183398},
NOTE = {37 pages + references},
YEAR = {2025},
MONTH = Jul,
KEYWORDS = {Large language Models LLMs ; Benchmark ; Code Intelligence ; Software Engineering},
PDF = {https://hal.science/hal-05183398v1/file/main.pdf},
HAL_ID = {hal-05183398},
HAL_VERSION = {v1},
}