Skip to content

paperwave/ThinkPRM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 

Repository files navigation

Process Reward Models That Think 🧠

🎉News

  • [2025-04-23] Our paper is released on arxiv.
  • [2025-04-24] Our synthetic verification CoTs used to train ThinkPRM on are now on huggingface.
  • [2025-04-25] Our trained PRMs are released in two sizes: 1.5B and 14B finetuned from R1-Distill-Qwen models.

📖Introduction

We introduce ThinkPRM, a collection of generative long CoT process reward models. Our verifiers are obtained by finetuning reasoning models over 1K synthetic verification CoTs---filtered based on only on 8K process labels from PRM800K. The resulting verifiers outperform LLM-as-a-judge, discriminative PRMs, on most in- and out-of-domain setups. ThinkPRM enables scaling up verifier compute either in parallel or sequentially by thinking longer.

image

📀Data Collection

ThinkPRM was trained on synthetic verification CoTs. This dataset contains 1,000 high-quality synthetic verification chains-of-thought (CoTs) designed for training generative Process Reward Models (PRMs), as used in the paper "Process Reward Models that Think". The goal was to create a data-efficient alternative to traditional PRM training which often requires extensive human annotation or expensive rollouts.

Each instance consists of a math problem, a corresponding multi-step solution prefix (sourced from PRM800K [Lightman et al., 2023]), and a detailed verification CoT generated by the QwQ-32B-Preview. The verification CoT critiques each step of the solution prefix and provides a step-level correctness judgment (\boxed{correct} or \boxed{incorrect}).

To ensure high-quality synthetic CoTs, only chains where all step-level judgments matched the ground-truth human annotations from the PRM800K dataset were retained. They were also filtered based on correct formatting and length constraints to avoid issues like excessive overthinking observed in unfiltered generation. The figure below summarizes the synthetic cots collection. Refer to our paper for more details on data collection.

image.png

The dataset was created to enable efficient training of powerful generative PRMs. The core idea is that fine-tuning strong reasoning models on carefully curated, synthetic verification CoTs can yield verifiers that outperform models trained on much larger, traditionally labeled datasets. The process-based filtering (matching gold step labels) was shown to be crucial for generating high-quality training data compared to outcome-based filtering.

✨Getting Started

Code is coming soon.

🎈Citation

If you find ThinkPRM helpful, please cite us.

@article{khalifa2025,
      title={Process Reward Models That Think}, 
      author={Muhammad Khalifa and Rishabh Agarwal and Lajanugen Logeswaran and Jaekyeom Kim and Hao Peng and Moontae Lee and Honglak Lee and Lu Wang},
      year={2025},
      journal={arXiv preprint arXiv:2504.16828},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.16828}, 
}

About

Process Reward Models That Think

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published