- [2025-04-23] Our paper is released on arxiv.
- [2025-04-24] Our synthetic verification CoTs used to train ThinkPRM on are now on huggingface.
- [2025-04-25] Our trained PRMs are released in two sizes: 1.5B and 14B finetuned from R1-Distill-Qwen models.
We introduce ThinkPRM, a collection of generative long CoT process reward models. Our verifiers are obtained by finetuning reasoning models over 1K synthetic verification CoTs---filtered based on only on 8K process labels from PRM800K. The resulting verifiers outperform LLM-as-a-judge, discriminative PRMs, on most in- and out-of-domain setups. ThinkPRM enables scaling up verifier compute either in parallel or sequentially by thinking longer.
ThinkPRM was trained on synthetic verification CoTs. This dataset contains 1,000 high-quality synthetic verification chains-of-thought (CoTs) designed for training generative Process Reward Models (PRMs), as used in the paper "Process Reward Models that Think". The goal was to create a data-efficient alternative to traditional PRM training which often requires extensive human annotation or expensive rollouts.
Each instance consists of a math problem, a corresponding multi-step solution prefix (sourced from PRM800K [Lightman et al., 2023]), and a detailed verification CoT generated by the QwQ-32B-Preview. The verification CoT critiques each step of the solution prefix and provides a step-level correctness judgment (\boxed{correct} or \boxed{incorrect}).
To ensure high-quality synthetic CoTs, only chains where all step-level judgments matched the ground-truth human annotations from the PRM800K dataset were retained. They were also filtered based on correct formatting and length constraints to avoid issues like excessive overthinking observed in unfiltered generation. The figure below summarizes the synthetic cots collection. Refer to our paper for more details on data collection.
The dataset was created to enable efficient training of powerful generative PRMs. The core idea is that fine-tuning strong reasoning models on carefully curated, synthetic verification CoTs can yield verifiers that outperform models trained on much larger, traditionally labeled datasets. The process-based filtering (matching gold step labels) was shown to be crucial for generating high-quality training data compared to outcome-based filtering.
Code is coming soon.
If you find ThinkPRM helpful, please cite us.
@article{khalifa2025,
title={Process Reward Models That Think},
author={Muhammad Khalifa and Rishabh Agarwal and Lajanugen Logeswaran and Jaekyeom Kim and Hao Peng and Moontae Lee and Honglak Lee and Lu Wang},
year={2025},
journal={arXiv preprint arXiv:2504.16828},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.16828},
}
