Experiment Setup

This repo contains code for

Process Reward Model(PRM): for math problem solving
Outcome Reward Model(ORM): for math problem solving
Bradley-Terry Model: for chat alignment

For PRM and ORM, the implementation is based on the idea in Let's Verify Step by Step.

And for Bradley-Terry Model, the implementation is following the idea in InstructGPT.

Experiment Setup

I only have 1 GTX 3090 24GB which has very limited memory. So here are a few tricks to run the experiment under this hard constraint

Base model: Qwen3-1.7B-Base which is a relatively small model
Use batch size 1 and gradient accumulation steps 32 to simulate a batch size of 32
Use 8-bit Adam optimizer to reduce the optimizer state memory

Outcome Reward Model

The outcome reward model uses the correctness of the outcome of the problem as the reward signal to train the model. More specifically, this is a binary classification problem and the label is 1 if the outcome is correct and 0 otherwise.

Data Stats

We use RLHFlow/Mistral-ORM-Data as the training and evaluation data. The statistics of the data is as follows:

Dataset Split	Min Length(# of tokens)	Max Length(# of tokens)	Average Length(# of tokens)	Median Length(# of tokens)	Correct Answer Count	Incorrect Answer Count	Total Count
Train	36	2007	258.64	217.0	82314	189912	272226
Val	75	1088	254.27	211.0	139	361	500
Test	77	1353	249.97	219.0	147	353	500

Training

Following Let's Verify Step by Step, we use the per token classification loss function to train the reward model.

Evaluation

Still following Let's Verify Step by Step, we use the score from the last token of the reward model as the score of the problem.

Process Reward Model

Opposed to the outcome reward model, the process reward model will use the correctness of each reasoning step as the reward signal to train the model. This is more fine-grained and more effective. But the shortcoming is that we need to collect the human labels for each reasoning step.

Data

We use PRM800K from OpenAI as the training data.

Training

As mentioned in Let's Verify Step by Step, the classification problem is turned into a auto-regressive problem which is amazing. It effectively uses the unsupervised training pipeline to train a supervised model. Again, excellent idea.

More specifically, there would be a special token at the end of each reasoning step. This special token is used to locate the classification place for each step. The output for this token is a vector of length vocabulary size. And we will use scores/logits of two tokens(+/-) to represent the probability of the correctness of the step.

Evaluation

The evaluation is pretty much the same as the training.

Bradley-Terry Model

Bradley-Terry model is used to model the ranking between two items. In our case, the ranking is based on the human preference. The reward model is used to predict the reward of an LLM response. If response A is preferred over response B, then the reward of A should be higher than that of B.

Data

We use Anthropic/hh-rlhf as the training/evaluation data.

Training

A chat between user and LLM is the input to a pretrained LLM. The model will output a score and we use it as the reward of the chat. Each sample contains two chats: one is chosen and the other is rejected. So we expect the score of the chosen chat to be higher than that of the rejected one.

Evaluation

Pretty much the same as the training.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
prm_data		prm_data
README.md		README.md
bradley_terry.json		bradley_terry.json
bradley_terry_rm.py		bradley_terry_rm.py
bradley_terry_trainer.py		bradley_terry_trainer.py
math_orm.py		math_orm.py
math_prm.py		math_prm.py
orm.json		orm.json
prm.json		prm.json
prm_trainer.py		prm_trainer.py
trainer.py		trainer.py
training_config.py		training_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Experiment Setup

Outcome Reward Model

Data Stats

Training

Evaluation

Process Reward Model

Data

Training

Evaluation

Bradley-Terry Model

Data

Training

Evaluation

About

Uh oh!

Releases

Packages

Languages

liyuan24/minRewardModel

Folders and files

Latest commit

History

Repository files navigation

Experiment Setup

Outcome Reward Model

Data Stats

Training

Evaluation

Process Reward Model

Data

Training

Evaluation

Bradley-Terry Model

Data

Training

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages