FormalBench: A Comprehensive Evaluation of LLMs on Formal Specification Inference

This repository contains evaluation infrastructure for FormalBench including evaluation metrics and wrappers for calling LLMs. If you found this repository to be useful, please cite our research paper:

@article{le2025can,
  title={Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference},
  author={Le-Cong, Thanh and Le, Bach and Murray, Toby},
  booktitle={The 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025},
  url={https://openreview.net/forum?id=zXyiakJYzB}
}

Prerequisites

FormalBench have two modes:

Docker mode, which requires the docker in your local computers
Non-Docker mode, which requires Ubuntu 20.04

Versions:

Python 3.12
Pip 25.0
Java 11
Clang-12

Installation

Step 1: Clone FormalBench to local computers:

git clone https://github.com/thanhlecongg/FormalBench.git

Step 2: Create virtual environment with Python (default version of FormalBench is Python 3.12)

python3 -m venv .env
. .env/bin/activate

Step 3: Install FormalBench

pip3 install -e .[default]

Step 4: Setup API keys: To use LLMs and langsmith (optional), please setup the enviroment file at FormalBench/config/.env. See the following for an example:

OPENAI_API_KEY=<OpenAI API Key>
DEEPSEEK_API_KEY = <DeepSeek API Key>
ANTHROPIC_API_KEY = <Anthropic API Key>
HF_TOKEN=<HunggingFace Access Token>
### Optional
LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY=<LangSmith API key>
LANGSMITH_PROJECT=<LangSmith Projet Name>

Supported Features

Language: Currently, we support:
- Java with its specification language, JML
- C with its specification language, ACSL
Verification tools: Currently, we provide wrapper to two tools:
- OpenJML tool version 17 & 21 to verify Java programs annotated by JML
- Frama-C tool version 26 (Iron) to verify C programs annotated by ACSL
Evaluation metrics:
- Consistency metrics: Verification success rate & failure rate
- Completeness metrics: We implemented our completeness metric based mutation analysis:
  - For Java, we use Major framework
  - For C, we use Mull framework
LLM-based specification inference:
- workflow:
  - basic: Generate spec and verify it using verfication tool
  - only_gen: Only generate spec
- prompt_type: zero shot, few shot, least-to-most, chain-of-thought
- LLMs: DeepSeekV3, GPT-4o, GPT3.5, o3-mini, o1-mini, Claude3.5-Sonnet, CodeQwen2.5 (QwenCoder-32B), CodeQwen1.5-7B, CodeLLama-34B, DeepSeekV2-32B

Basic Usage

Please refers to demo folder and demo/USAGE.MD for detailed instructions and examples

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
FormalBench		FormalBench
demo		demo
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FormalBench: A Comprehensive Evaluation of LLMs on Formal Specification Inference

Prerequisites

Installation

Supported Features

Basic Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

thanhlecongg/FormalBench

Folders and files

Latest commit

History

Repository files navigation

FormalBench: A Comprehensive Evaluation of LLMs on Formal Specification Inference

Prerequisites

Installation

Supported Features

Basic Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages