Skip to content

This repo contains code for Process Reward Model(PRM), Outcome Reward Model and Bradley-Terry model

Notifications You must be signed in to change notification settings

liyuan24/minRewardModel

Repository files navigation

This repo contains code for

  • Process Reward Model(PRM): for math problem solving
  • Outcome Reward Model(ORM): for math problem solving
  • Bradley-Terry Model: for chat alignment

For PRM and ORM, the implementation is based on the idea in Let's Verify Step by Step.

And for Bradley-Terry Model, the implementation is following the idea in InstructGPT.

Experiment Setup

I only have 1 GTX 3090 24GB which has very limited memory. So here are a few tricks to run the experiment under this hard constraint

  1. Base model: Qwen3-1.7B-Base which is a relatively small model
  2. Use batch size 1 and gradient accumulation steps 32 to simulate a batch size of 32
  3. Use 8-bit Adam optimizer to reduce the optimizer state memory

Outcome Reward Model

The outcome reward model uses the correctness of the outcome of the problem as the reward signal to train the model. More specifically, this is a binary classification problem and the label is 1 if the outcome is correct and 0 otherwise.

Data Stats

We use RLHFlow/Mistral-ORM-Data as the training and evaluation data. The statistics of the data is as follows:

Dataset Split Min Length(# of tokens) Max Length(# of tokens) Average Length(# of tokens) Median Length(# of tokens) Correct Answer Count Incorrect Answer Count Total Count
Train 36 2007 258.64 217.0 82314 189912 272226
Val 75 1088 254.27 211.0 139 361 500
Test 77 1353 249.97 219.0 147 353 500

Training

Following Let's Verify Step by Step, we use the per token classification loss function to train the reward model.

Evaluation

Still following Let's Verify Step by Step, we use the score from the last token of the reward model as the score of the problem.

Process Reward Model

Opposed to the outcome reward model, the process reward model will use the correctness of each reasoning step as the reward signal to train the model. This is more fine-grained and more effective. But the shortcoming is that we need to collect the human labels for each reasoning step.

Data

We use PRM800K from OpenAI as the training data.

Training

As mentioned in Let's Verify Step by Step, the classification problem is turned into a auto-regressive problem which is amazing. It effectively uses the unsupervised training pipeline to train a supervised model. Again, excellent idea.

More specifically, there would be a special token at the end of each reasoning step. This special token is used to locate the classification place for each step. The output for this token is a vector of length vocabulary size. And we will use scores/logits of two tokens(+/-) to represent the probability of the correctness of the step.

Evaluation

The evaluation is pretty much the same as the training.

Bradley-Terry Model

Bradley-Terry model is used to model the ranking between two items. In our case, the ranking is based on the human preference. The reward model is used to predict the reward of an LLM response. If response A is preferred over response B, then the reward of A should be higher than that of B.

Data

We use Anthropic/hh-rlhf as the training/evaluation data.

Training

A chat between user and LLM is the input to a pretrained LLM. The model will output a score and we use it as the reward of the chat. Each sample contains two chats: one is chosen and the other is rejected. So we expect the score of the chosen chat to be higher than that of the rejected one.

Evaluation

Pretty much the same as the training.

About

This repo contains code for Process Reward Model(PRM), Outcome Reward Model and Bradley-Terry model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages