Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence

Welcome to the official repository for our survey paper:

“Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence”

📌 Overview

With the rapid evolution of Large Language Models (LLMs) such as GPT-2, GPT-3, and their successors, there has been a transformative shift in the field of code intelligence, enabling significant advances in tasks like code generation, program repair, software testing, and debugging.

To ensure these models are evaluated rigorously and meaningfully, benchmarking plays a crucial role.

In this work, we systematically review:

142 research papers
156 unique benchmark datasets
32 different code-related tasks

We analyze each dataset across four key dimensions:

General landscape and coverage
Dataset construction and quality assurance
Evaluation protocols
Limitations and gaps

🔍 Key Findings

Python is the most dominant language (used in 77% of datasets)
GitHub is the primary data source (46% usage)
Most benchmarks focus on code generation (86 datasets)
Benchmark creation has notably accelerated in the past 3 years
Gaps exist in terms of bias, dataset evolution, and standardized evaluation

📄 Paper Access

You can read the full survey here: 📖 [https://hal.science/view/index/docid/5183398]

📚 Citation

If you find this work useful in your research, please consider citing it:

@article{abdollahi:hal-05183398,
  TITLE = {{Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence}},
  AUTHOR = {Abdollahi, Mohammad and Zhang, Ruixin and Shiri Harzevili, Nima and Shin, Jiho and Wang, Song and Hemmati, Hadi},
  URL = {https://hal.science/hal-05183398},
  NOTE = {37 pages + references},
  YEAR = {2025},
  MONTH = Jul,
  KEYWORDS = {Large language Models LLMs ; Benchmark ; Code Intelligence ; Software Engineering},
  PDF = {https://hal.science/hal-05183398v1/file/main.pdf},
  HAL_ID = {hal-05183398},
  HAL_VERSION = {v1},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Benchmarks Metadata.xlsx		Benchmarks Metadata.xlsx
README.md		README.md
WebCrawler.py		WebCrawler.py
Web_Crawler.ipynb		Web_Crawler.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence

📌 Overview

🔍 Key Findings

📄 Paper Access

📚 Citation

About

Uh oh!

Releases

Packages

Languages

mohammad-abdollahi/survey_llm_benchmark_dataset

Folders and files

Latest commit

History

Repository files navigation

Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence

📌 Overview

🔍 Key Findings

📄 Paper Access

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages