MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

This repository contains the official implementation and datasets for MedRECT, the first cross-lingual benchmark (Japanese/English) for medical error detection and correction in clinical texts.

Overview

Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts—a prerequisite for safe deployment—remains under-evaluated, particularly beyond English. MedRECT addresses this gap by providing:

Cross-lingual evaluation (Japanese/English) for medical error correction
Three progressive subtasks: Error Detection, Error Sentence Extraction, and Error Correction
Scalable automated pipeline for creating high-quality benchmarks from medical licensing examinations
Diverse error types including diagnosis, monitoring/management, physical findings, procedures, and medication
Comprehensive evaluation of 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families

Installation

Prerequisites

Python 3.11
uv package manager

Basic Installation

# Clone the repository
git clone https://github.com/pfnet-research/medrect.git
cd medrect

# Install dependencies (basic: API evaluation + viewer)
uv sync

# For local LLM inference
uv sync --group vllm

Environment Setup

# Copy environment template
cp .env.example .env

# Edit .env with your API keys
# Required for evaluation:
# - OPENROUTER_API_KEY (for OpenAI models)
# - AZURE_OPENAI_API_KEY (for Azure OpenAI)

BLEURT Setup (Optional)

The error_correction_full metric requires the BLEURT-20 checkpoint. Download and configure the path in configs/tasks/medec.yaml. See the Task Configuration Guide for detailed setup instructions.

Quick Start

Evaluating Models on MedRECT

# Evaluate a model on MedRECT-ja (Japanese)
uv run --env-file .env batch \
  --task medec \
  --models gpt-4_1 \
  --datasets medrect_ja \
  --templates 0_shot_ja

# Evaluate on MedRECT-en (English)
uv run --env-file .env batch \
  --task medec \
  --models gpt-4_1 \
  --datasets medrect_en \
  --templates 0_shot_en

Using Configuration Files

# Run evaluation with a config file (Japanese)
uv run --env-file .env batch \
  --config configs/batch/medrect_ja_all.yaml

# Run evaluation with a config file (English)
uv run --env-file .env batch \
  --config configs/batch/medrect_en_all.yaml

Viewing Results

After running evaluations, you can view the results using the interactive viewer:

# Launch the interactive results viewer
uv run viewer

The viewer displays evaluation results (*raw_responses.json, *predictions.json) from the results/ directory. Run the evaluation commands above to generate results, which will be automatically available in the viewer.

Configuration

The evaluation system uses a three-level configuration hierarchy in the configs/ directory:

configs/tasks/: Define evaluation tasks, datasets, and metrics (details)
- Specify which datasets to use (MedRECT-ja, MedRECT-en, etc.)
- Configure evaluation metrics (error detection, sentence extraction, error correction)
configs/models/: Configure LLM models and inference parameters (details)
- API-based models (OpenAI, Azure, OpenRouter, etc.)
- Local models (vLLM)
configs/batch/: Orchestrate evaluation experiments (details)
- Combine tasks, models, datasets, and templates
- Run multiple evaluations in parallel

See the README.md in each subdirectory for detailed configuration instructions.

Datasets

Benchmark Data

Dataset	Samples	Error %	Correct %	Source
MedRECT-ja	663	55.4%	44.6%	JMLE 2024-2025
MedRECT-en	458	53.1%	46.9%	MEDEC MS Subset Test

To reproduce the MedRECT-ja benchmark dataset construction from JMLE questions, see the Dataset Construction Pipeline documentation.

Training Data

Dataset	Samples	Error %	Correct %	Source
MedRECT-ja-train	5,538	65.2%	34.8%	JMLE 2018-2023
MedRECT-en-train	2,439	51.0%	49.0%	MEDEC MS Subset Training & Validation

To construct training data with reasoning, see the Training Data Construction documentation.

Citation

If you use MedRECT in your research, please cite our paper:

@article{iwase2025medrect,
  title={MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts},
  author={Iwase, Naoto and Okuyama, Hiroki and Iwasawa, Junichiro},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

License

Source Code

This project is licensed under the MIT License - see the LICENSE file for details.

Datasets

Datasets in data/ are licensed under CC-BY-4.0.

Our MedRECT datasets are derived from the following sources:

JMLE (Japanese Medical Licensing Examinations)
- Source: Ministry of Health, Labour and Welfare (厚生労働省)
- License: PDL-1.0 (compatible with CC-BY-4.0)
- Used for: MedRECT-ja
MEDEC MS Subset
- Source: Ben Abacha, A., Yim, W., Fu, Y., Sun, Z., Yetisgen, M., Xia, F., & Lin, T. (2024). MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes. arXiv preprint arXiv:2412.19260.
- Repository: https://github.com/abachaa/MEDEC
- License: CC-BY-4.0
- Used for: MedRECT-en

Acknowledgments

We are grateful to:

The creators of MEDEC for providing the foundational methodology and dataset
The Ministry of Health, Labour and Welfare (厚生労働省) for making the Japanese Medical Licensing Examination data publicly available

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

Table of Contents

Overview

Installation

Prerequisites

Basic Installation

Environment Setup

BLEURT Setup (Optional)

Quick Start

Evaluating Models on MedRECT

Using Configuration Files

Viewing Results

Configuration

Datasets

Benchmark Data

Training Data

Citation

License

Source Code

Datasets

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

pfnet-research/medrect

Folders and files

Latest commit

History

Repository files navigation

MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

Table of Contents

Overview

Installation

Prerequisites

Basic Installation

Environment Setup

BLEURT Setup (Optional)

Quick Start

Evaluating Models on MedRECT

Using Configuration Files

Viewing Results

Configuration

Datasets

Benchmark Data

Training Data

Citation

License

Source Code

Datasets

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages