FeatBench

Offical implementation of our paper "FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding". paper

Abstract

The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as 'vibe coding,' where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibecoding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: ❶ Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. ❷ A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. ❸ Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. ❹ Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.

Benchmark Highlights

Pure natural-language prompts – task inputs contain only abstract user-facing descriptions with no code snippets or signature hints, mirroring vibe coding interactions.
Release-grounded corpus – each instance originates from a curated GitHub release and pull-request history, yielding high-signal requirements and verified reference patches.
Rigorous, evolving pipeline – a multi-stage, fully automated collection system applies quality filters, mitigates data contamination, and can roll forward continuously as new releases ship.
Comprehensive regression checks – Fail-to-Pass (F2P) and Pass-to-Pass (P2P) pytest selections ensure both new behaviour and legacy functionality are validated.
Diverse domains – 27 actively maintained repositories spanning AI/ML, DevOps, web platforms, and productivity tools provide broad coverage of real-world tech stacks.

Metadata

The FeatBench dataset contains the following key attributes for each evaluation instance:

repo: Repository name in the format owner/name (e.g., "home-assistant/core")
instance_id: Unique identifier combining repository name and issue/PR number (e.g., "home-assistant__core-153575")
org: GitHub organization or user name
number: Associated issue or pull request number
version: Version tag of the release containing this feature
base_commit: Git commit hash of the base version before the feature implementation
created_at: Timestamp when the feature was released (ISO 8601 format)
patch: Array of source code modifications
- filename: Path to the modified file
- status: Modification status (typically "modified")
- additions: Number of lines added
- deletions: Number of lines deleted
- changes: Total number of changes (additions + deletions)
- patch: Unified diff format showing the actual code changes
test_patch: Array of test file modifications with the same structure as patch
- Contains additions, deletions, and unified diffs for test files
- Used to validate both FAIL_TO_PASS and PASS_TO_PASS test cases
problem_statement: Human-readable description of the feature to implement
test_files: List of test file paths that validate this feature
processed: Boolean flag indicating whether the instance has been validated
FAIL_TO_PASS: Tests that should pass after implementing the feature
PASS_TO_PASS: Tests that should continue passing (regression checks)
docker_image: Named like featbench_repo:number

Example Instance Structure

{
  "repo": "home-assistant/core",
  "instance_id": "home-assistant__core-153575",
  "base_commit": "3f9421ab0801a339e62506c0c123066c53810efb",
  "patch": [...],
  "test_patch": [...],
  "problem_statement": "I want to ensure that when my Z-Wave adapter...",
  "created_at": "2025-10-03T16:39:14Z",
  "version": "2025.10.1",
  "org": "home-assistant",
  "number": 153575,
  "test_files": ["tests/components/zwave_js/test_config_flow.py"],
  "processed": true
}

Prerequisites

System python with uv (used to install trae-agent).
Python 3.10 or later (3.13 recommended to match the configured containers).
Docker Engine 24+.
Recent Git installation for repository cloning inside containers.
Access tokens:
- A GitHub personal access token with repo and read:org permissions.
- An LLM provider key (OpenAI-compatible) for PR summarisation.

Installation

git clone https://github.com/Kndy666/FeatBench.git
cd FeatBench

conda create -n FeatBench python=3.13 -y
conda activate FeatBench

pip install -r requirements.txt
pip install -e .

Instructions

FeatBench operates in three main phases: Data Collection, Environment Building, and Evaluation. Each phase has its own configuration and requirements.

Phase 1: Data Collection Pipeline

The data collection system mines real feature releases from GitHub to generate evaluation datasets.

The process includes four stages:

Repository Collection (release_collector.py): Mines GitHub for repositories based on stars and release count
Release Analysis (release_analyzer.py): Analyzes release content to identify new features
PR Enhancement (pr_analyzer.py): Enriches with PR-level diffs and LLM-generated task descriptions
Output Generation (main.py): Orchestrates all stages to produce final_analysis_results.json

1. Configure Sensitive Information

First, create a .secrets.toml file in the docker_agent and configure the following:

# data_collect/.secrets.toml
[common]
github_token = "ghp_xxx"  # GitHub Personal Access Token with 'repo' and 'read:org' permissions
openai_api_key = "xxx"  # OpenAI-compatible API key

2. Modify Data Collection Configuration

Modify the settings in data_collect/config.toml as needed.

3. Run Data Collection

cd FeatBench
python -m data_collect.main

The script supports several optional command-line arguments to customize the execution:

--no-cache: Do not use cached data; reprocess all repositories and analyses from scratch.
--collect-only: Perform only the repository collection stage and skip subsequent release analysis and PR enhancement.
--analyze-only: Perform only the release analysis stage, assuming repository collection has already been done.
--enhance-only: Perform only the PR enhancement stage, assuming previous stages are complete.

Phase 2: Environment Building Pipeline

Build Docker container environments to prepare evaluation infrastructure.

1. Temporary File Directory

The program stores temporary files in the docker_agent/swap/ subdirectory under the running directory:

Contains trae-agent clones and configuration files
Creates independent container images for each repository
Note: First run may require several GB of space, depending on the number of repositories processed

2. Trae-Agent Configuration

On the first run, the program will clone trae-agent in the docker_agent/swap/trae-agent/ directory and exit.

You need to configure in the trae-agent directory:

cd docker_agent/swap/trae-agent
cp trae_config.yaml.example trae_config.yaml
# Edit trae_config.yaml as needed

3. Modify Settings (if needed)

Modify configurations in docker_agent/settings.toml as needed:

Logging configuration (level, log_file): Adjust log level and output location
Execution configuration (max_specs_per_repo): Limit maximum specifications per repository
Docker configuration (docker_timeout): Container operation timeout (default: 180 seconds)
Proxy configuration (proxy_enabled, proxy_http, proxy_https): If operating in a proxy environment

4. Run Environment Building

cd FeatBench
python -m docker_agent.runner.main --agents your_agent

Phase 3: Evaluation Execution

Run agents in isolated Docker containers to implement features.

1. Dataset Transformation

First, transform the data from the collection phase:

cd FeatBench
python -m docker_agent.tools.main

Alternatively, you can use the preprocessed dataset file dataset/featbench_v1_0.json (156 curated instances used in the original paper).

2. Trae-Agent Evaluation (Default)

The codebase defaults to supporting trae-agent evaluation.

Then run the evaluator:

python -m docker_agent.runner.main --evaluate --agents your_agent

3. Custom Agent Evaluation (if needed)

To evaluate other agents or models, follow these three steps:

Step 1: Create a new file in the docker_agent/agents/ directory, inheriting from the BaseAgent class in base.py

# docker_agent/agents/your_agent.py
from docker_agent.agents.base import BaseAgent

class YourAgent(BaseAgent):
    def _prepare_agent_code(self):
        # Prepare agent code
        pass

    def prepare_resources(self, patch: str) -> Optional[List[Dict[str, Any]]]:
        # Prepare agent-specific resources before evaluation
        pass
	def evaluate(self, spec: Spec, operator, *args, **kwargs) -> Dict[str, Any]:
		# Evaluate agent on a specific spec
		pass

Refer to docker_agent/agents/trae_agent.py for detailed implementation.

Step 2: Modify the _create_agent method in docker_agent/agents/manager.py

def _create_agent(self, agent_name: str, config: dict) -> BaseAgent:
    if agent_name == "trae_agent":
        return TraeAgent(config)
    elif agent_name == "your_agent":
        return YourAgent(config)
    else:
        ...

Step 3: Update the docker_agent/agents.toml configuration file

# docker_agent/agents.toml
[your_agent]
name = "Your Agent"
model = "gpt-4"
provider = "openai"
install_commands = [
    "pip install your-agent-package"
]
repo_url = "https://github.com/your/agent"
branch = "main"

Output Files

The evaluation process generates the following key files:

final_analysis_results*.json: Curated evaluation summaries
evaluation_results_file.json: Agents Evaluation results
docker_agent/swap/: Temporary working directory (can be safely deleted)

Output Format

Each evaluation instance produces a result object with the following structure:

{
  "agent": "trae-agent",
  "model": "deepseek-chat",
  "instance_id": "instructlab__instructlab-3286",
  "success_f2p": false,
  "success_p2p": false,
  "success": false,
  "passed_f2p_tests": [],
  "passed_p2p_tests": [],
  "total_tokens": 542776,
  "patch_application": {
    "total_files_num": 1,
    "applied_files_num": 1,
    "applied_files": [
      "dspy/primitives/tool.py"
    ],
    "patch_content": "diff --git a/dspy/primitives/tool.py ..."
  }
}

agent: Name of the evaluated agent (e.g., "trae-agent", "your_agent")
model: Underlying LLM model used by the agent
instance_id: Unique identifier matching the input instance
success_f2p: Whether all FAIL_TO_PASS tests pass
success_p2p: Whether all PASS_TO_PASS tests pass
success: Overall success (true if both F2P and P2P succeed)
passed_f2p_tests: List of FAIL_TO_PASS tests that passed
passed_p2p_tests: List of PASS_TO_PASS tests that passed
total_tokens: Total tokens consumed during the evaluation
patch_application: Details about the generated patch
- total_files_num: Total number of files in the patch
- applied_files_num: Number of files successfully applied
- applied_files: List of files that were successfully applied
- patch_content: Unified diff of the generated changes

Leaderboard

License

This project is licensed under the MIT License.

Citation

If you use FeatBench in your research, please cite our paper:

@misc{chen2025featbenchevaluatingcodingagents,
      title={FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding}, 
      author={Haorui Chen and Chengze Li and Jia Li},
      year={2025},
      eprint={2509.22237},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22237}, 
}

Support

If you have any questions or suggestions, please email us at [email protected] or feel free to make issues~

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
data_collect		data_collect
dataset		dataset
docker_agent		docker_agent
paper_evaluation_results		paper_evaluation_results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FeatBench

Abstract

Benchmark Highlights

Metadata

Example Instance Structure

Prerequisites

Installation

Instructions

Phase 1: Data Collection Pipeline

1. Configure Sensitive Information

2. Modify Data Collection Configuration

3. Run Data Collection

Phase 2: Environment Building Pipeline

1. Temporary File Directory

2. Trae-Agent Configuration

3. Modify Settings (if needed)

4. Run Environment Building

Phase 3: Evaluation Execution

1. Dataset Transformation

2. Trae-Agent Evaluation (Default)

3. Custom Agent Evaluation (if needed)

Output Files

Output Format

Leaderboard

License

Citation

Support

About

Uh oh!

Releases

Packages

Languages

License

Kndy666/FeatBench

Folders and files

Latest commit

History

Repository files navigation

FeatBench

Abstract

Benchmark Highlights

Metadata

Example Instance Structure

Prerequisites

Installation

Instructions

Phase 1: Data Collection Pipeline

1. Configure Sensitive Information

2. Modify Data Collection Configuration

3. Run Data Collection

Phase 2: Environment Building Pipeline

1. Temporary File Directory

2. Trae-Agent Configuration

3. Modify Settings (if needed)

4. Run Environment Building

Phase 3: Evaluation Execution

1. Dataset Transformation

2. Trae-Agent Evaluation (Default)

3. Custom Agent Evaluation (if needed)

Output Files

Output Format

Leaderboard

License

Citation

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages