Offical implementation of our paper "FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding". paper
The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as 'vibe coding,' where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibecoding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: ❶ Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. ❷ A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. ❸ Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. ❹ Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.
- Pure natural-language prompts – task inputs contain only abstract user-facing descriptions with no code snippets or signature hints, mirroring vibe coding interactions.
- Release-grounded corpus – each instance originates from a curated GitHub release and pull-request history, yielding high-signal requirements and verified reference patches.
- Rigorous, evolving pipeline – a multi-stage, fully automated collection system applies quality filters, mitigates data contamination, and can roll forward continuously as new releases ship.
- Comprehensive regression checks – Fail-to-Pass (F2P) and Pass-to-Pass (P2P) pytest selections ensure both new behaviour and legacy functionality are validated.
- Diverse domains – 27 actively maintained repositories spanning AI/ML, DevOps, web platforms, and productivity tools provide broad coverage of real-world tech stacks.
The FeatBench dataset contains the following key attributes for each evaluation instance:
- repo: Repository name in the format
owner/name(e.g., "home-assistant/core") - instance_id: Unique identifier combining repository name and issue/PR number (e.g., "home-assistant__core-153575")
- org: GitHub organization or user name
- number: Associated issue or pull request number
- version: Version tag of the release containing this feature
- base_commit: Git commit hash of the base version before the feature implementation
- created_at: Timestamp when the feature was released (ISO 8601 format)
- patch: Array of source code modifications
- filename: Path to the modified file
- status: Modification status (typically "modified")
- additions: Number of lines added
- deletions: Number of lines deleted
- changes: Total number of changes (additions + deletions)
- patch: Unified diff format showing the actual code changes
- test_patch: Array of test file modifications with the same structure as
patch- Contains additions, deletions, and unified diffs for test files
- Used to validate both FAIL_TO_PASS and PASS_TO_PASS test cases
- problem_statement: Human-readable description of the feature to implement
- test_files: List of test file paths that validate this feature
- processed: Boolean flag indicating whether the instance has been validated
- FAIL_TO_PASS: Tests that should pass after implementing the feature
- PASS_TO_PASS: Tests that should continue passing (regression checks)
- docker_image: Named like
featbench_repo:number
{
"repo": "home-assistant/core",
"instance_id": "home-assistant__core-153575",
"base_commit": "3f9421ab0801a339e62506c0c123066c53810efb",
"patch": [...],
"test_patch": [...],
"problem_statement": "I want to ensure that when my Z-Wave adapter...",
"created_at": "2025-10-03T16:39:14Z",
"version": "2025.10.1",
"org": "home-assistant",
"number": 153575,
"test_files": ["tests/components/zwave_js/test_config_flow.py"],
"processed": true
}- System python with
uv(used to installtrae-agent). - Python 3.10 or later (3.13 recommended to match the configured containers).
- Docker Engine 24+.
- Recent Git installation for repository cloning inside containers.
- Access tokens:
- A GitHub personal access token with
repoandread:orgpermissions. - An LLM provider key (OpenAI-compatible) for PR summarisation.
- A GitHub personal access token with
git clone https://github.com/Kndy666/FeatBench.git
cd FeatBench
conda create -n FeatBench python=3.13 -y
conda activate FeatBench
pip install -r requirements.txt
pip install -e .FeatBench operates in three main phases: Data Collection, Environment Building, and Evaluation. Each phase has its own configuration and requirements.
The data collection system mines real feature releases from GitHub to generate evaluation datasets.
The process includes four stages:
- Repository Collection (
release_collector.py): Mines GitHub for repositories based on stars and release count - Release Analysis (
release_analyzer.py): Analyzes release content to identify new features - PR Enhancement (
pr_analyzer.py): Enriches with PR-level diffs and LLM-generated task descriptions - Output Generation (
main.py): Orchestrates all stages to producefinal_analysis_results.json
First, create a .secrets.toml file in the docker_agent and configure the following:
# data_collect/.secrets.toml
[common]
github_token = "ghp_xxx" # GitHub Personal Access Token with 'repo' and 'read:org' permissions
openai_api_key = "xxx" # OpenAI-compatible API keyModify the settings in data_collect/config.toml as needed.
cd FeatBench
python -m data_collect.mainThe script supports several optional command-line arguments to customize the execution:
--no-cache: Do not use cached data; reprocess all repositories and analyses from scratch.--collect-only: Perform only the repository collection stage and skip subsequent release analysis and PR enhancement.--analyze-only: Perform only the release analysis stage, assuming repository collection has already been done.--enhance-only: Perform only the PR enhancement stage, assuming previous stages are complete.
Build Docker container environments to prepare evaluation infrastructure.
The program stores temporary files in the docker_agent/swap/ subdirectory under the running directory:
- Contains
trae-agentclones and configuration files - Creates independent container images for each repository
- Note: First run may require several GB of space, depending on the number of repositories processed
On the first run, the program will clone trae-agent in the docker_agent/swap/trae-agent/ directory and exit.
You need to configure in the trae-agent directory:
cd docker_agent/swap/trae-agent
cp trae_config.yaml.example trae_config.yaml
# Edit trae_config.yaml as neededModify configurations in docker_agent/settings.toml as needed:
- Logging configuration (
level,log_file): Adjust log level and output location - Execution configuration (
max_specs_per_repo): Limit maximum specifications per repository - Docker configuration (
docker_timeout): Container operation timeout (default: 180 seconds) - Proxy configuration (
proxy_enabled,proxy_http,proxy_https): If operating in a proxy environment
cd FeatBench
python -m docker_agent.runner.main --agents your_agentRun agents in isolated Docker containers to implement features.
First, transform the data from the collection phase:
cd FeatBench
python -m docker_agent.tools.mainAlternatively, you can use the preprocessed dataset file dataset/featbench_v1_0.json (156 curated instances used in the original paper).
The codebase defaults to supporting trae-agent evaluation.
Then run the evaluator:
python -m docker_agent.runner.main --evaluate --agents your_agentTo evaluate other agents or models, follow these three steps:
Step 1: Create a new file in the docker_agent/agents/ directory, inheriting from the BaseAgent class in base.py
# docker_agent/agents/your_agent.py
from docker_agent.agents.base import BaseAgent
class YourAgent(BaseAgent):
def _prepare_agent_code(self):
# Prepare agent code
pass
def prepare_resources(self, patch: str) -> Optional[List[Dict[str, Any]]]:
# Prepare agent-specific resources before evaluation
pass
def evaluate(self, spec: Spec, operator, *args, **kwargs) -> Dict[str, Any]:
# Evaluate agent on a specific spec
passRefer to docker_agent/agents/trae_agent.py for detailed implementation.
Step 2: Modify the _create_agent method in docker_agent/agents/manager.py
def _create_agent(self, agent_name: str, config: dict) -> BaseAgent:
if agent_name == "trae_agent":
return TraeAgent(config)
elif agent_name == "your_agent":
return YourAgent(config)
else:
...Step 3: Update the docker_agent/agents.toml configuration file
# docker_agent/agents.toml
[your_agent]
name = "Your Agent"
model = "gpt-4"
provider = "openai"
install_commands = [
"pip install your-agent-package"
]
repo_url = "https://github.com/your/agent"
branch = "main"The evaluation process generates the following key files:
final_analysis_results*.json: Curated evaluation summariesevaluation_results_file.json: Agents Evaluation resultsdocker_agent/swap/: Temporary working directory (can be safely deleted)
Each evaluation instance produces a result object with the following structure:
{
"agent": "trae-agent",
"model": "deepseek-chat",
"instance_id": "instructlab__instructlab-3286",
"success_f2p": false,
"success_p2p": false,
"success": false,
"passed_f2p_tests": [],
"passed_p2p_tests": [],
"total_tokens": 542776,
"patch_application": {
"total_files_num": 1,
"applied_files_num": 1,
"applied_files": [
"dspy/primitives/tool.py"
],
"patch_content": "diff --git a/dspy/primitives/tool.py ..."
}
}- agent: Name of the evaluated agent (e.g., "trae-agent", "your_agent")
- model: Underlying LLM model used by the agent
- instance_id: Unique identifier matching the input instance
- success_f2p: Whether all FAIL_TO_PASS tests pass
- success_p2p: Whether all PASS_TO_PASS tests pass
- success: Overall success (true if both F2P and P2P succeed)
- passed_f2p_tests: List of FAIL_TO_PASS tests that passed
- passed_p2p_tests: List of PASS_TO_PASS tests that passed
- total_tokens: Total tokens consumed during the evaluation
- patch_application: Details about the generated patch
- total_files_num: Total number of files in the patch
- applied_files_num: Number of files successfully applied
- applied_files: List of files that were successfully applied
- patch_content: Unified diff of the generated changes
This project is licensed under the MIT License.
If you use FeatBench in your research, please cite our paper:
@misc{chen2025featbenchevaluatingcodingagents,
title={FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding},
author={Haorui Chen and Chengze Li and Jia Li},
year={2025},
eprint={2509.22237},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22237},
}If you have any questions or suggestions, please email us at [email protected] or feel free to make issues~


